Closed DrZoltanFazekas closed 3 months ago
the 20 sec minimum delay observed is due to the formula used to compute the exponential_backoff_timeout
at https://github.com/Zilliqa/zq2/blob/main/zilliqa/src/consensus.rs#L453
since self.view.get_view().saturating_sub(self.high_qc.view)
is 2 when we're waiting for a new proposal (the last proposal we received in view self.view.get_view() - 1
contained self.high_qc
whose view was self.view.get_view() - 2
) and thus we get 5 sec 2^2 = 5 sec 4 = 20 sec
here an example where the gap due to a missing view amounts to 80 sec:
{"timestamp":"2024-07-01T14:32:36.866176Z","level":"TRACE","fields":{"message":"Not proceeding with view change. Current view: 2142 - time since last: 79762, timeout requires: 80000"},"target":"zilliqa::consensus","line_number":400}
the value 80000 is the exponential_backoff_timeout
computed at https://github.com/Zilliqa/zq2/blob/main/zilliqa/src/consensus.rs#L453C43-L453C98 as consensus_timeout_ms * 2u64.pow(view_difference as u32)
here another example where the actual gap was 60 sec:
2120 2113 14:29:37 0x9a21038511f75492bac85e3d6eae5b37c3fbbdca 0 []
####################### view 2121 + 2122 missing - 60 sec
2123 2113 14:30:37 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 []
and the respective log items
waiting for the 20000 ms timeout:
{"message":"Not proceeding with view change. Current view: 2121 - time since last: 19569, timeout requires: 20000"},"target":"zilliqa::consensus","line_number":400}
and the 40000 ms timeout:
{"timestamp":"2024-07-01T14:30:36.810235Z","level":"TRACE","fields":{"message":"Not proceeding with view change. Current view: 2122 - time since last: 39635, timeout requires: 40000"},"target":"zilliqa::consensus","line_number":400}
As we can see, the last view found in a block was 2120 i.e. the timeout of view 2121 took 20 sec and the timeout of 2122 took 40 sec, i.e. they took altogether 60 sec.
Once https://github.com/Zilliqa/zq2/issues/928 is implemented we can check if the offline validator was indeed chosen as the leader in 2 consequent views.
An example where the gap was 80 sec:
2138 2128 14:30:55 0xcca93a2e1169caf02e515203d1539cc3f390890a 0 [] <- highQC
2139 2129 14:30:56 0x3907d3d5397af065c5c8b8c4389278de28222885 0 []
####################### view 2140 missing - 20 sec
2141 2129 14:31:17 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 [] no new highQC
####################### view 2142 missing - 80 sec
2143 2129 14:32:37 0xcca93a2e1169caf02e515203d1539cc3f390890a 0 []
As we can see, in view 2142 the highQC's view is still 2138 i.e. the exponential_backoff_timeout
equals 5 2 ^ (2142 - 2138) = 5 16 = 80 sec
The timeouts observed above seem to be correct except for the 4x factor, which can be eliminated by modifying
https://github.com/Zilliqa/zq2/blob/main/zilliqa/src/consensus.rs#L453
to
let exponential_backoff_timeout = consensus_timeout_ms * 2u64.pow((view_difference as u32).saturating_sub(2));
Fixed by #1125
We tested shutting down one of the 4 validators. In views when the offline validator was the leader, it didn't propose a block, but the rest of the validators behaved according to the protocol: they sent new view messages to the next leader and the next leader proposed a block. However, the gap caused by the missing proposal in the failed views was much longer than expected the 5 sec timeout.
Here's a log of how long the gaps took: