Closed OisinKyne closed 1 year ago
Note there is a simple work around for this. Just wrap the DKG command in a while-not-loop
while ! docker run --rm -v "$(pwd):/opt/charon" obolnetwork/charon:v0.12.0 dkg; do echo Retrying DKG; done
Suggest:
client.AwaitConnected
into client.IsConnected
client.DisableReconnect
c.sendMsgs
to return (relayBroke bool, connBroke bool, otherErr error)
.c.Run
into two states based on internal c.reconnect bool
state:
c.reconnect=true
, reconnect on either relayBroke
or connBroke
with retry=true
.c.reconnect=false
, only reconnect on relayBroke
with retry=false
.c.clearConnected
when c.sendMsgs
returnsdefer s.clearConnected(pID)
to server
c.connect
succeeded and it wasn't due relayBroke
.c.connect
and it is due to connBroke
. Suggest test cases:
Problem to be solved
We originally made the DKG connection brittle, and failed fast when a connection was dropped. This is making asychronous DKGs difficult.
Proposed solution
Let's not abort if we lose a peer, and instead try and recover the connection later. At the end of the DKG, we should keep the process running for one extra second while we attempt to check if all peers are still connected, and if it appears they are not we should print a big warning saying something to the effect that "There may have been a participant that had connection issues during the DKG, please confirm with every operator that their private key shares were created in
.charon/validator_keys
.In Scope
Updating our tests to assert the DKG command can handle a whole number of random process exits.
Out of Scope
If the DKG failed, we shouldn't auto retry it, we should just auto reconnect in the 'waiting to connect to peers' phase, not bang smack in the middle of the ceremony phase, we can outright error then, preferably with a helpful error.