Zilliqa / zq2

Zilliqa 2.0 code base
Apache License 2.0
9 stars 0 forks source link

View timeouts take too long #1106

Closed DrZoltanFazekas closed 3 months ago

DrZoltanFazekas commented 3 months ago

We tested shutting down one of the 4 validators. In views when the offline validator was the leader, it didn't propose a block, but the rest of the validators behaved according to the protocol: they sent new view messages to the next leader and the next leader proposed a block. However, the gap caused by the missing proposal in the failed views was much longer than expected the 5 sec timeout.

Here's a log of how long the gaps took:

view  block          miner                                      gas txns
-------------------------------------------------------------------------
56479 56461 06:48:36 0xcca93a2e1169caf02e515203d1539cc3f390890a 0 []
####################### view 56480 missing - 20 sec
56481 56461 06:48:56 0x9a21038511f75492bac85e3d6eae5b37c3fbbdca 0 []
####################### view 56482 missing - 80 sec
56483 56461 06:50:17 0xcca93a2e1169caf02e515203d1539cc3f390890a 0 []
56484 56462 06:50:18 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 []
56485 56463 06:50:19 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 []
56486 56464 06:50:20 0x9a21038511f75492bac85e3d6eae5b37c3fbbdca 0 []
56487 56465 06:50:21 0x9a21038511f75492bac85e3d6eae5b37c3fbbdca 0 []
56488 56466 06:50:23 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 []
56489 56467 06:50:24 0x9a21038511f75492bac85e3d6eae5b37c3fbbdca 0 []
56490 56468 06:50:25 0xcca93a2e1169caf02e515203d1539cc3f390890a 0 []
56491 56469 06:50:26 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 []
56492 56470 06:50:27 0xcca93a2e1169caf02e515203d1539cc3f390890a 0 []
56493 56471 06:50:29 0x9a21038511f75492bac85e3d6eae5b37c3fbbdca 0 []
####################### view 56495 missing - 60 sec
56496 56471 06:51:29 0xcca93a2e1169caf02e515203d1539cc3f390890a 0 []
####################### view 56497 missing - 160 sec
56498 56471 06:54:10 0x9a21038511f75492bac85e3d6eae5b37c3fbbdca 0 []
56499 56472 06:54:11 0xcca93a2e1169caf02e515203d1539cc3f390890a 0 []
####################### view 56500 missing - 20 sec
56501 56472 06:54:31 0x9a21038511f75492bac85e3d6eae5b37c3fbbdca 0 []
56502 56473 06:54:33 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 []
56503 56474 06:54:34 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 []
56504 56475 06:54:35 0x9a21038511f75492bac85e3d6eae5b37c3fbbdca 0 []
####################### view 56505 missing - 20 sec
56506 56475 06:54:55 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 []
...
56529 56496 06:55:41 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 []
####################### view 56531 missing - 60 sec
56532 56496 06:56:41 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 []
####################### view 56533 missing - 160 sec
56534 56496 06:59:22 0xcca93a2e1169caf02e515203d1539cc3f390890a 0 []
DrZoltanFazekas commented 3 months ago

the 20 sec minimum delay observed is due to the formula used to compute the exponential_backoff_timeout at https://github.com/Zilliqa/zq2/blob/main/zilliqa/src/consensus.rs#L453 since self.view.get_view().saturating_sub(self.high_qc.view) is 2 when we're waiting for a new proposal (the last proposal we received in view self.view.get_view() - 1 contained self.high_qc whose view was self.view.get_view() - 2) and thus we get 5 sec 2^2 = 5 sec 4 = 20 sec

DrZoltanFazekas commented 3 months ago

here an example where the gap due to a missing view amounts to 80 sec: {"timestamp":"2024-07-01T14:32:36.866176Z","level":"TRACE","fields":{"message":"Not proceeding with view change. Current view: 2142 - time since last: 79762, timeout requires: 80000"},"target":"zilliqa::consensus","line_number":400}

the value 80000 is the exponential_backoff_timeout computed at https://github.com/Zilliqa/zq2/blob/main/zilliqa/src/consensus.rs#L453C43-L453C98 as consensus_timeout_ms * 2u64.pow(view_difference as u32)

here another example where the actual gap was 60 sec:

2120 2113 14:29:37 0x9a21038511f75492bac85e3d6eae5b37c3fbbdca 0 []
####################### view 2121 + 2122 missing - 60 sec
2123 2113 14:30:37 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 []

and the respective log items waiting for the 20000 ms timeout: {"message":"Not proceeding with view change. Current view: 2121 - time since last: 19569, timeout requires: 20000"},"target":"zilliqa::consensus","line_number":400} and the 40000 ms timeout: {"timestamp":"2024-07-01T14:30:36.810235Z","level":"TRACE","fields":{"message":"Not proceeding with view change. Current view: 2122 - time since last: 39635, timeout requires: 40000"},"target":"zilliqa::consensus","line_number":400}

As we can see, the last view found in a block was 2120 i.e. the timeout of view 2121 took 20 sec and the timeout of 2122 took 40 sec, i.e. they took altogether 60 sec.

Once https://github.com/Zilliqa/zq2/issues/928 is implemented we can check if the offline validator was indeed chosen as the leader in 2 consequent views.

DrZoltanFazekas commented 3 months ago

An example where the gap was 80 sec:

2138 2128 14:30:55 0xcca93a2e1169caf02e515203d1539cc3f390890a 0 [] <- highQC
2139 2129 14:30:56 0x3907d3d5397af065c5c8b8c4389278de28222885 0 []
####################### view 2140 missing - 20 sec
2141 2129 14:31:17 0x3e422c617eb2880f7ac7be512755e0e103bb916b 0 [] no new highQC
####################### view 2142 missing - 80 sec
2143 2129 14:32:37 0xcca93a2e1169caf02e515203d1539cc3f390890a 0 []

As we can see, in view 2142 the highQC's view is still 2138 i.e. the exponential_backoff_timeout equals 5 2 ^ (2142 - 2138) = 5 16 = 80 sec

DrZoltanFazekas commented 3 months ago

The timeouts observed above seem to be correct except for the 4x factor, which can be eliminated by modifying https://github.com/Zilliqa/zq2/blob/main/zilliqa/src/consensus.rs#L453 to let exponential_backoff_timeout = consensus_timeout_ms * 2u64.pow((view_difference as u32).saturating_sub(2));

DrZoltanFazekas commented 3 months ago

Fixed by #1125