Fix timeouts growing in Zug when network is stalled

fizyk20 commented 3 weeks ago

If more than 1/3 of the validators by weight go offline, it causes the network to stall - which is expected, since the network requires at least 2/3 of the validators to be correct (online and adhering to protocol) in order to make progress.

Right now, if the network gets stalled while using Zug, the remaining validators start to time out waiting for a proposal to get accepted and start increasing their timeouts in order to adjust to what they perceive as network delays. However, after timing out, they set another timer, and the cycle repeats, causing the timeouts to grow without bound.

This PR changes it so that validators only time out at most once per round. This way if the network gets stalled, they increase their timeout once and wait for the round to end (by either becoming skippable, or having an accepted proposal). This will happen once enough validators are back online, but while the network is stalled, they no longer increase their timeout further - which fixes #4927 while preserving the algorithm's assumptions.

EdHastingsCasperAssociation commented 3 weeks ago

bors r+

casperlabs-bors-ng[bot] commented 3 weeks ago

Build succeeded:

continuous-integration/drone/push

casper-network / casper-node

Fix timeouts growing in Zug when network is stalled #4945