Make DKG program fault tolerant during 'connecting to peers' phase.

ObolNetwork / charon

Charon (pronounced 'kharon') is a Proof of Stake Ethereum Distributed Validator Client

https://docs.obol.tech/

Other

175 stars 82 forks source link

Make DKG program fault tolerant during 'connecting to peers' phase. #1475

Closed OisinKyne closed 1 year ago

OisinKyne commented 1 year ago

Problem to be solved

We originally made the DKG connection brittle, and failed fast when a connection was dropped. This is making asychronous DKGs difficult.

Proposed solution

Let's not abort if we lose a peer, and instead try and recover the connection later. At the end of the DKG, we should keep the process running for one extra second while we attempt to check if all peers are still connected, and if it appears they are not we should print a big warning saying something to the effect that "There may have been a participant that had connection issues during the DKG, please confirm with every operator that their private key shares were created in .charon/validator_keys.

In Scope

Updating our tests to assert the DKG command can handle a whole number of random process exits.

Out of Scope

If the DKG failed, we shouldn't auto retry it, we should just auto reconnect in the 'waiting to connect to peers' phase, not bang smack in the middle of the ceremony phase, we can outright error then, preferably with a helpful error.

corverroos commented 1 year ago

Note there is a simple work around for this. Just wrap the DKG command in a while-not-loop

while ! docker run --rm -v "$(pwd):/opt/charon" obolnetwork/charon:v0.12.0 dkg; do echo Retrying DKG; done

corverroos commented 1 year ago

Suggest:

Refactor client.AwaitConnected into client.IsConnected
Once all clients are connected then call new method client.DisableReconnect
Refactor c.sendMsgs to return (relayBroke bool, connBroke bool, otherErr error).
Refactor c.Run into two states based on internal c.reconnect bool state:
- If c.reconnect=true, reconnect on either relayBroke or connBroke with retry=true.
- If c.reconnect=false, only reconnect on relayBroke with retry=false.
- Call c.clearConnected when c.sendMsgs returns
add defer s.clearConnected(pID) to server
Move "(dis)connected to peer" logging from server to client:
- Log "connected" when c.connect succeeded and it wasn't due relayBroke.
- Log "disconnected" before calling c.connect and it is due to connBroke.

corverroos commented 1 year ago

Suggest test cases:

2 of 3 peers connected, 2 drops, then reconnects, then 3 connects and $$$
2 of 3 peers connected, 1 and 2 drops, then both reconnect, then 2 drops and reconnects, then 3th connects and $$$
etc