Zilliqa / zq2

Zilliqa 2.0 code base
Apache License 2.0
9 stars 0 forks source link

A restarted node does not set their high_qc to that of other node's NewView messages #1798

Open 86667 opened 5 days ago

86667 commented 5 days ago

TLDR: We need to re-request missed proposals after startup (I believe)

The issue arises when a node goes down before receiving a proposal, whilst the other nodes do receive it. We then end up in a situation where the node that went down is 1 view behind the other nodes, yet all have the same exponential backoff period since their high_qc values are different. This results in syncing never happening.

Example logs:

Node that went down with high_qc view 2 has a exp. backoff timeout of 40_000 for view 6.

2024-11-12T12:11:31.636766Z TRACE zilliqa::consensus: 507: Not proceeding with view change. Current view: 6 - time since last: 4575, timeout requires: 40000

Other nodes with high_qc view 3 have an exp. backoff timeout of 40_000 for view 7

2024-11-12T12:11:32.122232Z TRACE zilliqa::consensus: 507: Not proceeding with view change. Current view: 7 - time since last: 21155, timeout requires: 40000

Other nodes will always move on to the next view before the node that went down can catch up.

What should happen is that once our restarted node becomes leader it receives the NewView messages for a view higher than its own and updates its view at that point. However, it fails to propcess the NewView messages at this point because of:

parent not found while determining leader for view