Recovery from sequence window expiration incident

ImTei commented 1 month ago

High level description

If op-batcher stops batch submission for a while and the sequence window is expired, op-node starts to generate empty blocks. But even if the op-batcher becomes operational, it's hard to recover batch submission and chain derivation because new batches will be likely dropped. This issue includes details of the situation and proposed solution.

Details of the incident

For some reason, batches are not submitted for a while and the sequence window expired.
op-node generates empty batches and safe head is advanced with empty blocks. Chain reorg occurs.
Sequencer builds new blocks after generated empty blocks.
op-batcher makes the next batch from the its current safe head and submit the batch.
While the new batch is built and submitted to the L1, op-node generates next empty blocks and reorged again.
The new batch will be dropped by following reasons: i. If the batch has non-empty blocks, it would not be canonical blocks after new empty blocks. ii. If the batch has only empty blocks and the batch is a span batch, the first block of span batch is already over sequence window. So the entire batch is dropped.

Repeat 2~6

Currently, we have to do following things manually to recover the chain from the incident.

Block new user TX submissions.
Empty sequencer's TX pool.
Run op-batcher as a singular batch mode until the chain derivation is recovered.

We may add these steps to the runbook, but we can improve system to automate these steps.

Solution

Define a new state of op-node which indicates "Currently sequence window is expired and generating empty batches". Let's say this as incident mode for now.
incident mode is enabled when the op-node generates empty block. and disabled when the op-node derives new block from L1 batch.
If the op-node is in incident mode, sequencer builds empty blocks by setting NoTxPool as true.
incident mode is included as a boolean value in the optimism_syncStatus RPC response.
op-batcher can check if the op-node is incident mode by syncStatus RPC. If the op-node is in incident mode, op-batcher builds singular batch even if it's running as a span batch mode. (Or we can make it build span batch far from the current safe head to avoid sequence window expiration)

Discussion

This change can automate the incident recovery of OP stack chains, but it may be a bit risky because it's touching a lot of important features like sequencing and batch submission. Because this incident situation is very unlikely, we can consider more manual way to recover the system from the incident.

emilianobonassi commented 1 month ago

Thanks @ImTei for tracking the issue on this edge case we found!

I do agree this might not be handled automatically in first instance.

I think could be a good opportunity to spin-up a playbook/runbook section in the docs to describe this and other scenarios that might realize, providing best practices (at conduit we have internally).

Another example, blob congestion => switch to calldata, make a blob transaction type to cancel the pending one (see LZ airdrop).

sebastianst commented 3 weeks ago

Another example, blob congestion => switch to calldata, make a blob transaction type to cancel the pending one (see LZ airdrop).

@emilianobonassi This is already done automatically by the batcher since https://github.com/ethereum-optimism/optimism/pull/10941

ethereum-optimism / optimism