ObolNetwork / charon

Charon (pronounced 'kharon') is a Proof of Stake Ethereum Distributed Validator Client
https://docs.obol.tech/
Other
197 stars 88 forks source link

DKG network issues on shutdown #887

Closed corverroos closed 1 year ago

corverroos commented 2 years ago

Problem to be solved

The DKG protocol requires all peers to shutdown gracefully before writing the output files to disk.

The problem is that this cannot be atomic across the whole cluster. In some networking edge cases, some peers would see that everyone shutdown, while one or two might not. Shutdown response messages might get lost during transit due to broken connections, resulting in some nodes successfully shutting down and others not being able to reconnect and will therefore error.

This results in those peers not getting DKG output files, while others do.

Proposed solution

Note this is a classic Two Generals problem: "How do I know you received my ack to your shutdown msg?".

Note that this is exacerbated by relay connections that break every 2min and need to be re-established. If we upgrade relay connections to direct hole-punched connections, then they are more stable, and they should break less. So hole punching mitigates this issue, since it results in direct connections.... unless those connections also break every 2min... I still need to confirm that. If they do break every 2min, then it doesn't help.

One solution to this is to not require ALL nodes to shutdown gracefully:

Out of Scope

We still follow the general approach of "wait for all peers to connect (which might take hours)", but once all connected, we start DKG with "Fail-Fast", so any error results in exit should SHOULD result in all peers exiting. We still do not want to add retry logic during DKG since this complicated the code a lot.

corverroos commented 2 years ago

image seems like a race condition between relay circuit expiry and peers finishing DKG:

boulder225 commented 1 year ago

To check if we can still reproduce it, or not @gsora @dB2510 @xenowits

corverroos commented 1 year ago

The explicit sync steps probably solves this issue.