op-node: issue with sequencer stop / start behavior

0x00101010 commented 1 month ago

Bug Description

Base Sepolia experienced a ~1 hour chain halt at Oct 14th, 2024. Upon investigation, the sequence of events are listed below

HA1 is the leader and is sequencing
[06:22:44] Upon working on block 16571938
- it timed out [06:22:49] trying to insert block into op-geth after 5 seconds
- log shows failed to insert execution payload: failed to execute payload: context deadline exceeded
- but it did gossip out the unsafe block on op-node side, meaning HA2 & HA3 has the latest block on op-node, and will try to insert it into op-geth themselves (however that'll time out as well)
[06:22:49] HA1 became unhealthy, and leadership transferred to HA2
[06:22:49] upon becoming leader, HA2 tries to start sequencer locally, however it cannot because of chain head hash mismatch
- in HA2 op-geth, it has block 16571937, which does not match 16571938
- conductor tries to get payload from consensus layer, and post 16571938 payload to op-node again however it was rejected immediately code, because HA2 has that unsafe payload already
- cannot add duplicate payload 0xb7d2dea272494a387e4dc447b681dfd42f043e360a2a51814ecce260eed85a1c:16571938
- HA2 also timed out trying to insert 16571938 into op-geth, because of the timeout, later ForkChoiceUpdate call was not executed, thus the chain head was not updated to 16571938
[06:22:49-06:22:56] HA2 continues to retry starting sequencer, but cannot (because latest block hash is not updated in op-geth)
- upon becoming unhealthy itself, HA2 initiates leadership transfer again
[06:22:56] HA1 became leader again
- And since HA1 previously is the block builder, it does not have 16571938 in its own payloadsQueue, therefore conductor was able to post unsafe payload to op-node again, and trigger ForkChoiceRequestEvent to update chain head
- However after the chain head is updated, the sequencer does not schedule another sequencing step any more

Steps to Reproduce

With HA setup, construct a big block that takes longer than 10s (timeout of NewPayload in v1.9.3)

Environment Information:

op-node: v1.9.2

Issues / Expected behavior

There are 2 issues here:

sequencer.nextActionOK might still be the false and sequencer.nextAction might still be previous time where after resuming sequencing under some circumstances, such as this condition where next payload is restored from Raft consensus.
- There are likely more places that this needs to be set, but concern right now is that this generally feels brittle and any new changes to the deriver's event pipeline could result in another similar bug in the future
if NewPayload or similar interaction with op-geth times out, leadership transfer will happen, and only the current sequencer (in this case HA1) will be able to get past it (eventually transfer leadership back to it) because it does not have the unsafe block in its payloadsQueue,
- all the other sequencers fail to become sequencer
- All the other sequencers will get unsafe payload from CL and tries to insert it => time out, forkchoice update not called
- once elected to be leader, conductor tries to start sequencing, however, latest block != local block (because new payload timed out, therefore forkchoice update is not called) conductor tries to get the unsafe payload from raft again and post to op-node, but was rejected immediately because the payload is in the payloadsQueue already

⚠️ Notice: Issues that do not include the following sections will be subject to closure:

Bug Description
Steps to Reproduce
Environment Information

Please ensure all required sections are filled out accurately to expedite the debugging process and improve issue resolution efficiency.

threewebcode commented 1 month ago

The timeout stops state update correctly.

protolambda commented 1 month ago

Thank you for filing this bug report, this information is very helpful.

The nextActionOK handling in the sequencer should be improved, I will look into what can be done there.

The leadership transfer is more difficult: with async-gossip, when the payload has been published already, it becomes the canonical block required to continue sequencing. If other replicas don't pick up on this block via p2p before getting the leadership transfer, then a re-attempt of leadership-transfer may be useful, but ultimately it can flake. If the op-conductor is able to insert this committed payload content, then it can recover from the missing-block case. Even then, the replica does need to make it a canonical block. If it's timing out / not doing a final forkchoice update, then something in the block-processing itself is wrong or is hitting a performance issue, and that then needs further investigation before we can fix it.

0x00101010 commented 1 month ago

Thank you for filing this bug report, this information is very helpful.

The nextActionOK handling in the sequencer should be improved, I will look into what can be done there.

The leadership transfer is more difficult: with async-gossip, when the payload has been published already, it becomes the canonical block required to continue sequencing. If other replicas don't pick up on this block via p2p before getting the leadership transfer, then a re-attempt of leadership-transfer may be useful, but ultimately it can flake. If the op-conductor is able to insert this committed payload content, then it can recover from the missing-block case. Even then, the replica does need to make it a canonical block. If it's timing out / not doing a final forkchoice update, then something in the block-processing itself is wrong or is hitting a performance issue, and that then needs further investigation before we can fix it.

Regarding leadership transfer:

conductor did indeed try to insert this committed payload content into op-node if it does not have it already (only for the latest block)
currently the issue here is not the timeout, it has already received p2p payload and tried to insert it (timed out), it is that when op-conductor tries to insert the payload again, because the payload is already inside the payloadsQueue, it will return immediately without calling ForkChoiceUpdateRequest later. => a immediate thought is to allow conductor to bypass the check (keep other behaviors the same), but not sure if that is the most elegant solution

0x00101010 commented 1 month ago

@protolambda Hey proto, curious if you got time to take a look at this?

ethereum-optimism / optimism

op-node: issue with sequencer stop / start behavior #12448