Integrating `StateMachine`s (gg20) into existing network stack and its difficulties.

drewstone commented 2 years ago

I'm looking to extend an already existing gossip based protocol w/ a DKG and trying to execute the state machine for gg20's Keygen first. It seems difficult with the existing architecture to do so given how tightly coupled things seem, starting with how most methods and struct fields are private.

The scenario is as follows.

I have handlers for finality messages of a given blockchain and on handling these messages I start/execute a DKG keygen and signing protocol.
I also have a gossip_engine that I can gossip messages from. This is all handled synchronously and the async logic is abstracted away.
Lastly, I have handlers for messages gossiped over a particular topic which I can once again continue broadcasting data from.

The goal and assumptions

I would like to execute the state machine directly when I either start a new DKG or receive messages from the gossip network.
I assume that the system can work by interacting with the proceed API directly (even though it is currently private).

Recommendations (which may be naive but want to share nonetheless)

Given the state of the StateMachine (such as Keygen) it would be great to have a method that returns the next message to be broadcasted over the network. This way I can always call something like get_active_message and continue to gossip this without worrying much.
Publicise the fields of the Keygen and other StateMachines as well as the methods for StateMachine such as proceed.

Let me know if there's prior art or ideas I should consider to think through this properly. Cheers

drewstone commented 2 years ago

Some other difficulties I'm facing relate to the Simulations. They take all parties up front and I'm not sure how to analyse the architecture as is when working with an environment where the parties are in fact distributed. How does one execute these protocols within a distributed context if the StateMachine is privately executable? Are there examples for this somewhere? It seems all methods to proceed the system are private and executed by a Simulation in all the examples I can find.

nskybytskyi commented 2 years ago

@drewstone disclaimer: I am not a developer of this library, but I used it a bit

there is .message_queue() that returns a queue of messages that a state machine wants to send. You can loop over it if it is nonempty. Once all messages are sent you would need to .handle_incoming() received ones;
afaik, state machine was designed not to call .proceed() itself, because it may cause some unpleasing latency. You can call .wants_to_proceed() and .proceed() methods when you are ready to face it;
the more public API your library has the more often you have to bump the version 🙃 (if you obey semver, ofc). I guess it might be possible reasoning behind keeping things private. Anyway, that is something easy to fix with a fork;

I guess you can also keep track of #132, it may answer some of your questions soon.

survived commented 2 years ago

Thanks, @Sky-Nik, you impressively covered the question! @drewstone Indeed, what you need in order to carry out the protocol is methods of StateMachine trait only. Workflow looks like:

If .message_queue() is not empty, empty it by sending queued messages
If state machine .wants_to_proceed(), then call .proceed() (and be aware that this method may take a while to complete)
State machine needs to handle every message you receive from other parties with .handle_incoming()
If protocol .is_finished(), retrieve result with .pick_output()

However, I would recommend to work with StateMachine methods directly only if you needed an incredibly precise control on protocol execution. Otherwise, just use AsyncProtocol that takes asynchronous channels of incoming and outgoing messages and carries out the protocol (see docs examples).

Btw, we work on major revision of round_based crate. It will follow completely different (I hope, simpler) design, specifically, StateMachine trait will be removed. It will be released in a few weeks, and GG20 will be updated to use it. We'll add usage examples and update README (#132)

dev0x1 commented 2 years ago

Thanks, @Sky-Nik @survived for the heads up, I am working with @drewstone. I have used StaeMachine APIs for GG20 KeyGen and implementation follow the steps mentioned above and I am able to generate the key for our use case.
However, I am seeing some strange behavior that state's message_queue keeps on increasing by one message in each round. For t=2, n=3 settings, message_queue length is 5 at final round,

{Keygen at round=2 msgs1=[None] msgs2=[0/2] msgs3=[0/2] msgs4=[0/2] queue=[len=2]}
.
.
.
  {Keygen at round=[Final] msgs1=[None] msgs2=[None] msgs3=[None] msgs4=[None] queue=[len=5]}

The items in message_queue are the protocol messages till that round, for example after final round, queue contains [KeyGenBroadcastMessage1, KeyGenDecommitMessage1, VerifiableSS, VerifiableSS, DLogProof].

So whenever a party wants to send messages to other parties, queue_messages gives previous round messages to send, and receiving party gets nx(previous_round_messages) bogus messages handling which with handle_incoming() fails with Critical error: ReceivedOutOfOrderMessage which I am ignoring as of now.

I am wondering if I have to do anything else to consume queue_messages. For now, I get an into_iter() on queue_messages and use move while map to encode the messages. Any help is much appreciated. Thanks

survived commented 2 years ago

message_queue keeps on increasing by one message in each round.

Hi @dev0x1, do you delete messege from the queue after you sent it? State machine only adds messages to the queue, it doesn't delete old messages. "Out of order message" error might be caused by double sending the same message

dev0x1 commented 2 years ago

Thanks @survived for your quick response. I was just over-cautious to only use StateMachine's public APIs. Now I am clearing the queue after sending the messages and everything works fine. Maybe, in case StaeMachine is going to be around for some time, we should document this somewhere or we provide an API in StateMachine to clear the queue.

survived commented 2 years ago

You're right, it's not obvious from documentation that messages need to be deleted from the queue. I updated .message_queue method documentation

tmpfs commented 2 years ago

I hit a snag updating the webassembly demo to gg2020 using the state machine directly so I wrote a test case to interact with the state machine without any networking. I thought I would add it here in case anyone wants to see a working example using the state machine without AsyncProtocol or Simulation.

@survived, is it worth adding this test to the repository as an example for people that need to interact with the state machine(s) directly?

Also, (and maybe this should be a different issue), I noticed that during key generation that wants_to_proceed() isn't matching my expectations. If you uncomment this line:

https://github.com/LavaMoat/gg2020-test/blob/main/src/lib.rs#L54

The test will fail. Am I misunderstanding the expected behavior or is this a bug?

survived commented 2 years ago

The test will fail. Am I misunderstanding the expected behavior or is this a bug?

.wants_to_proceed() indicates whether StateMachine wants to do some expensive computation. handle_incoming actually may jump to the next round if this operation is not expensive. The only intention of having .proceed() is to be able to handle heavy computations in different way (e.g. AsyncProtocol calls spawn_blocking)

You correctly observed that round 1 of DKG is not expensive and thus doesn't require calling .proceed()

survived commented 2 years ago

for people that need to interact with the state machine(s) directly?

@tmpfs, could you describe your case, why do you need to work with it directly? Asking because next version of round-based will have state machine removed, ie the only way to run the protocol will be by providing Stream and Sink for receiving and sending messages. I thought it's going to be easier to work with, but maybe I'm missing some use-cases?

tmpfs commented 2 years ago

@survived, because in webassembly with wasm-bindgen it makes more sense to do the networking on the Javascript side. Whilst it is possible to make network connections in js_sys it is complicated and would be especially awkward in the multi-threaded situation with multiple workers / instantiated WASM modules.

I would really appreciate it if we could keep the StateMachine design to facilitate this use case 🙏

survived commented 2 years ago

@tmpfs does web assembly prevents you from using async code? I'm not saying you have to implement delivery logic within Stream/Sink, you can just use UnboundedReceiver/UnboundedSender and feed them messages from js-code. You won't need to manually drive StateMachine, just supply these channels and everything works just fine.

drewstone commented 2 years ago

For background @survived, we are not using this library where we provide the Stream/Sink. Maybe we lack experience seeing how exactly it fits for our use case but we're integrating into an existing gossip based system.

We aren't using async/await directly here either. I'm not sure removing the StateMachine will make it easier for us to use this library, but we'll see once these changes you mention land I suppose.

tmpfs commented 2 years ago

@tmpfs does web assembly prevents you from using async code? I'm not saying you have to implement delivery logic within Stream/Sink, you can just use UnboundedReceiver/UnboundedSender and feed them messages from js-code. You won't need to manually drive StateMachine, just supply these channels and everything works just fine.

@survived, whilst I know that it is possible to use futures in webassembly I don't know exactly what it will look like as I have never tried.

However, I have some concerns about this approach:

1) I would need to embed an async runtime (tokio, async-std etc) into the web assembly module increasing the size of the WASM file. 2) I then need to convert the futures to promises to return values to the Javascript side (maybe using wasm-bindgen-futures). 3) I think adding the polling of the async runtime to what is effectively CPU-bound synchronous code adds unnecessary overhead and complexity.

Creating the Stream / Sink abstraction is a worthwhile effort but please don't sacrifice the synchronous use case. Can you build the Stream / Sink API as an optional and recommended way to use the API but build it on top of StateMachine so we can still call StateMachine manually?

survived commented 2 years ago

@tmpfs, unfortunately, StateMachine implementation is difficult to maintain. Any protocol implementation (e.g. gg20) requires a lot of non-trivial boilerplate, and it's simply difficult to deliver bugfixes, analyse the code, etc. New version of library allows to keep the code clean and clear, that's the whole point. But cost of it is tying to async.

Regarding your concerns:

Indeed, binary size is increased. Eventually, we may provide support for something like smol which should address your concern.
You can use Runtime::block_on to build a blocking API
Some overhead is added, not sure if it's significant.

To sum up, I believe it should work for your case just at additional cost of increased binary size and some (I believe, minor) computational overhead. In return, implementation of gg20 will get simpler, cleaner, easier to maintain, which is important security-wise.

tmpfs commented 2 years ago

Thanks for the explanation and rationale @survived! I wonder is this new version aspirational or is there some code that is work in progress I could take a look at?

If there is a branch I would really like to take a look and see what the shape of the new API might look like. Thanks 🙏

survived commented 2 years ago

There's round-based2 branch in the repo, but it's poorly documented at this point, though most of api is established. You can look at round-based-ing, lib that integrates round-based and ING's implementation of GG18: there're two structs Keygen and Signing, they can carry out DKG and threshold signing. API for multi-party-ecdsa is going to be similar.

Feel free to ask questions

ZenGo-X / multi-party-ecdsa

Integrating `StateMachine`s (gg20) into existing network stack and its difficulties. #139