This PR aims to fix the issue where the sequencer node fails to recover after a crash (specifically when the sequencer is a PBSS node).
related node PR
Rationale
When a sequencer node crashes, it may fail to persist the journal in time. As a result, when Geth is restarted, the journal data cannot be read, leading to the loss of recent state data. This causes the sequencer to fail during the buildPayload process, rendering it unable to continue operating. The diagram below illustrates the sequencer block production flow, with the red sections highlighting the logic that cannot proceed due to the crash.
In the diagram, sequencerAction alternates between stages (1) and (2). In stage (1), the payload is constructed, and in stage (2), the data is persisted. This process is not synchronously blocked; in stage (1), operations such as filling the payload's transactions (txs) and some other tasks are performed asynchronously, while the payload is returned synchronously and immediately, controlled by a condition lock (cond). Therefore, before the update in stage (1) is completed, the getPayload in stage (2) will be in a cond.wait state.
After the sequencer crashes, due to the loss of state data (for example, the sequencer crashes at block height 34123456 and the next block to be built is 34123457), the prepareWork phase depends on the state data of block 34123456. However, after the crash, the state data of block 34123456 is lost, leading to failure in this phase and consequently preventing the system from entering the update logic. As a result, the process will remain indefinitely blocked at getPayload, unable to make progress.
The fix involves adding two routines to handle the recovery process and monitor the recovery progress. Upon a failure in the generate phase, a fix routine is started based on specific error conditions. To avoid blocking the main process, a separate routine is also initiated to monitor the specific block being repaired. Once the data recovery is complete, a retry of the update process is triggered, allowing the system to recover the state from before the crash and continue making progress.
There are two scenarios for recovery: recovering from local data or from peers (for the sequencer, peers are its backup nodes). In most cases, the data can be recovered locally. However, there is a corner case where local recovery fails: if the sequencer has already gossiped the latest block to peer nodes but crashes during the local persistence process, the sequencer may fall behind the peer by one block. In this extreme situation, the sequencer must recover from the peer.
This is the complete sequencer recovery flow, as illustrated below.
Example
add an example CLI or API response...
Changes
Notable changes:
fix_manager object has been added to the worker to manage the status and progress during the fix process.
Added a method to recover the state from local data.
Introduced a new downloader method in the Backend interface.
Description
This PR aims to fix the issue where the sequencer node fails to recover after a crash (specifically when the sequencer is a PBSS node). related node PR
Rationale
When a sequencer node crashes, it may fail to persist the journal in time. As a result, when Geth is restarted, the journal data cannot be read, leading to the loss of recent state data. This causes the sequencer to fail during the buildPayload process, rendering it unable to continue operating. The diagram below illustrates the sequencer block production flow, with the red sections highlighting the logic that cannot proceed due to the crash.
In the diagram, sequencerAction alternates between stages (1) and (2). In stage (1), the payload is constructed, and in stage (2), the data is persisted. This process is not synchronously blocked; in stage (1), operations such as filling the payload's transactions (txs) and some other tasks are performed asynchronously, while the payload is returned synchronously and immediately, controlled by a condition lock (cond). Therefore, before the
update
in stage (1) is completed, thegetPayload
in stage (2) will be in acond.wait
state.After the sequencer crashes, due to the loss of state data (for example, the sequencer crashes at block height 34123456 and the next block to be built is 34123457), the
prepareWork
phase depends on the state data of block 34123456. However, after the crash, the state data of block 34123456 is lost, leading to failure in this phase and consequently preventing the system from entering theupdate
logic. As a result, the process will remain indefinitely blocked atgetPayload
, unable to make progress.The fix involves adding two routines to handle the recovery process and monitor the recovery progress. Upon a failure in the generate phase, a fix routine is started based on specific error conditions. To avoid blocking the main process, a separate routine is also initiated to monitor the specific block being repaired. Once the data recovery is complete, a retry of the update process is triggered, allowing the system to recover the state from before the crash and continue making progress.
There are two scenarios for recovery: recovering from local data or from peers (for the sequencer, peers are its backup nodes). In most cases, the data can be recovered locally. However, there is a corner case where local recovery fails: if the sequencer has already gossiped the latest block to peer nodes but crashes during the local persistence process, the sequencer may fall behind the peer by one block. In this extreme situation, the sequencer must recover from the peer.
This is the complete sequencer recovery flow, as illustrated below.
Example
add an example CLI or API response...
Changes
Notable changes: