Base Sepolia experienced a ~1 hour chain halt at Oct 14th, 2024. Upon investigation, the sequence of events are listed below
HA1 is the leader and is sequencing
[06:22:44] Upon working on block 16571938
it timed out [06:22:49] trying to insert block into op-geth after 5 seconds
log shows failed to insert execution payload: failed to execute payload: context deadline exceeded
but it did gossip out the unsafe block on op-node side, meaning HA2 & HA3 has the latest block on op-node, and will try to insert it into op-geth themselves (however that'll time out as well)
[06:22:49] HA1 became unhealthy, and leadership transferred to HA2
[06:22:49] upon becoming leader, HA2 tries to start sequencer locally, however it cannot because of chain head hash mismatch
in HA2 op-geth, it has block 16571937, which does not match 16571938
conductor tries to get payload from consensus layer, and post 16571938 payload to op-node again
however it was rejected immediately code, because HA2 has that unsafe payload already
HA2 also timed out trying to insert 16571938 into op-geth, because of the timeout, later ForkChoiceUpdate call was not executed, thus the chain head was not updated to 16571938
[06:22:49-06:22:56] HA2 continues to retry starting sequencer, but cannot (because latest block hash is not updated in op-geth)
upon becoming unhealthy itself, HA2 initiates leadership transfer again
[06:22:56] HA1 became leader again
And since HA1 previously is the block builder, it does not have 16571938 in its own payloadsQueue, therefore conductor was able to post unsafe payload to op-node again, and trigger ForkChoiceRequestEvent to update chain head
However after the chain head is updated, the sequencer does not schedule another sequencing step any more
Steps to Reproduce
With HA setup, construct a big block that takes longer than 10s (timeout of NewPayload in v1.9.3)
Environment Information:
op-node: v1.9.2
Issues / Expected behavior
There are 2 issues here:
sequencer.nextActionOK might still be the false and sequencer.nextAction might still be previous time where after resuming sequencing under some circumstances, such as this condition where next payload is restored from Raft consensus.
There are likely more places that this needs to be set, but concern right now is that this generally feels brittle and any new changes to the deriver's event pipeline could result in another similar bug in the future
if NewPayload or similar interaction with op-geth times out, leadership transfer will happen, and only the current sequencer (in this case HA1) will be able to get past it (eventually transfer leadership back to it) because it does not have the unsafe block in its payloadsQueue,
all the other sequencers fail to become sequencer
All the other sequencers will get unsafe payload from CL and tries to insert it => time out, forkchoice update not called
once elected to be leader, conductor tries to start sequencing, however, latest block != local block (because new payload timed out, therefore forkchoice update is not called)
conductor tries to get the unsafe payload from raft again and post to op-node, but was rejected immediately because the payload is in the payloadsQueue already
⚠️ Notice: Issues that do not include the following sections will be subject to closure:
Bug Description
Steps to Reproduce
Environment Information
Please ensure all required sections are filled out accurately to expedite the debugging process and improve issue resolution efficiency.
Bug Description
Base Sepolia experienced a ~1 hour chain halt at Oct 14th, 2024. Upon investigation, the sequence of events are listed below
16571938
failed to insert execution payload: failed to execute payload: context deadline exceeded
16571937
, which does not match16571938
cannot add duplicate payload 0xb7d2dea272494a387e4dc447b681dfd42f043e360a2a51814ecce260eed85a1c:16571938
16571938
Steps to Reproduce
With HA setup, construct a big block that takes longer than 10s (timeout of NewPayload in v1.9.3)
Environment Information:
op-node: v1.9.2
Issues / Expected behavior
There are 2 issues here:
sequencer.nextActionOK
might still be the false andsequencer.nextAction
might still be previous time where after resuming sequencing under some circumstances, such as this condition where next payload is restored from Raft consensus.⚠️ Notice: Issues that do not include the following sections will be subject to closure:
Please ensure all required sections are filled out accurately to expedite the debugging process and improve issue resolution efficiency.