ObolNetwork / charon

Charon (pronounced 'kharon') is a Proof of Stake Ethereum Distributed Validator Client
https://docs.obol.tech/
Other
175 stars 82 forks source link

Make DKG program fault tolerant during 'connecting to peers' phase. #1475

Closed OisinKyne closed 1 year ago

OisinKyne commented 1 year ago

Problem to be solved

We originally made the DKG connection brittle, and failed fast when a connection was dropped. This is making asychronous DKGs difficult.

Proposed solution

Let's not abort if we lose a peer, and instead try and recover the connection later. At the end of the DKG, we should keep the process running for one extra second while we attempt to check if all peers are still connected, and if it appears they are not we should print a big warning saying something to the effect that "There may have been a participant that had connection issues during the DKG, please confirm with every operator that their private key shares were created in .charon/validator_keys.

In Scope

Updating our tests to assert the DKG command can handle a whole number of random process exits.

Out of Scope

If the DKG failed, we shouldn't auto retry it, we should just auto reconnect in the 'waiting to connect to peers' phase, not bang smack in the middle of the ceremony phase, we can outright error then, preferably with a helpful error.

corverroos commented 1 year ago

Note there is a simple work around for this. Just wrap the DKG command in a while-not-loop

while ! docker run --rm -v "$(pwd):/opt/charon" obolnetwork/charon:v0.12.0 dkg; do echo Retrying DKG; done
corverroos commented 1 year ago

Suggest:

corverroos commented 1 year ago

Suggest test cases: