DKG network issues on shutdown

corverroos commented 2 years ago

Problem to be solved

The DKG protocol requires all peers to shutdown gracefully before writing the output files to disk.

The problem is that this cannot be atomic across the whole cluster. In some networking edge cases, some peers would see that everyone shutdown, while one or two might not. Shutdown response messages might get lost during transit due to broken connections, resulting in some nodes successfully shutting down and others not being able to reconnect and will therefore error.

This results in those peers not getting DKG output files, while others do.

Proposed solution

Note this is a classic Two Generals problem: "How do I know you received my ack to your shutdown msg?".

Note that this is exacerbated by relay connections that break every 2min and need to be re-established. If we upgrade relay connections to direct hole-punched connections, then they are more stable, and they should break less. So hole punching mitigates this issue, since it results in direct connections.... unless those connections also break every 2min... I still need to confirm that. If they do break every 2min, then it doesn't help.

One solution to this is to not require ALL nodes to shutdown gracefully:

If an error occurs during shutdown,
but the node did get and send shutdown messages to/from at least Quorum others
then we can write the output files to disk,
but we also print a WARNING that something went wrong and you should confirm with the group that everyone got there results.

Out of Scope

We still follow the general approach of "wait for all peers to connect (which might take hours)", but once all connected, we start DKG with "Fail-Fast", so any error results in exit should SHOULD result in all peers exiting. We still do not want to add retry logic during DKG since this complicated the code a lot.

corverroos commented 2 years ago

seems like a race condition between relay circuit expiry and peers finishing DKG:

While you are shutting down the sync protocol (send a shutdown msg and waiting for response)
Your relay circuit to that peer expired before you got shutdown response
You are going to re-connect and retry sending the shutdown message.
Since you think you didn't send the shutdown, but the peer did actually receive it.
So that peer completed the DKG in the mean time, and it stopped.
So you cannot reconnect (even though you just want to shutdown)
So I expect them to see a "Relay connection dropped, expect reconnect" on their side.
But they would have shutdown successfully.

boulder225 commented 1 year ago

To check if we can still reproduce it, or not @gsora @dB2510 @xenowits

corverroos commented 1 year ago

The explicit sync steps probably solves this issue.

ObolNetwork / charon