The DKG protocol requires all peers to shutdown gracefully before writing the output files to disk.
The problem is that this cannot be atomic across the whole cluster. In some networking edge cases, some peers would see that everyone shutdown, while one or two might not. Shutdown response messages might get lost during transit due to broken connections, resulting in some nodes successfully shutting down and others not being able to reconnect and will therefore error.
This results in those peers not getting DKG output files, while others do.
Proposed solution
Note this is a classic Two Generals problem: "How do I know you received my ack to your shutdown msg?".
Note that this is exacerbated by relay connections that break every 2min and need to be re-established. If we upgrade relay connections to direct hole-punched connections, then they are more stable, and they should break less. So hole punching mitigates this issue, since it results in direct connections.... unless those connections also break every 2min... I still need to confirm that. If they do break every 2min, then it doesn't help.
One solution to this is to not require ALL nodes to shutdown gracefully:
If an error occurs during shutdown,
but the node did get and send shutdown messages to/from at least Quorum others
then we can write the output files to disk,
but we also print a WARNING that something went wrong and you should confirm with the group that everyone got there results.
Out of Scope
We still follow the general approach of "wait for all peers to connect (which might take hours)", but once all connected, we start DKG with "Fail-Fast", so any error results in exit should SHOULD result in all peers exiting. We still do not want to add retry logic during DKG since this complicated the code a lot.
Problem to be solved
The DKG protocol requires all peers to shutdown gracefully before writing the output files to disk.
The problem is that this cannot be atomic across the whole cluster. In some networking edge cases, some peers would see that everyone shutdown, while one or two might not. Shutdown response messages might get lost during transit due to broken connections, resulting in some nodes successfully shutting down and others not being able to reconnect and will therefore error.
This results in those peers not getting DKG output files, while others do.
Proposed solution
Note this is a classic Two Generals problem: "How do I know you received my ack to your shutdown msg?".
Note that this is exacerbated by relay connections that break every 2min and need to be re-established. If we upgrade relay connections to direct hole-punched connections, then they are more stable, and they should break less. So hole punching mitigates this issue, since it results in direct connections.... unless those connections also break every 2min... I still need to confirm that. If they do break every 2min, then it doesn't help.
One solution to this is to not require ALL nodes to shutdown gracefully:
Out of Scope
We still follow the general approach of "wait for all peers to connect (which might take hours)", but once all connected, we start DKG with "Fail-Fast", so any error results in exit should SHOULD result in all peers exiting. We still do not want to add retry logic during DKG since this complicated the code a lot.