Open 0x00101010 opened 1 month ago
The timeout stops state update correctly.
Thank you for filing this bug report, this information is very helpful.
The nextActionOK
handling in the sequencer should be improved, I will look into what can be done there.
The leadership transfer is more difficult: with async-gossip, when the payload has been published already, it becomes the canonical block required to continue sequencing. If other replicas don't pick up on this block via p2p before getting the leadership transfer, then a re-attempt of leadership-transfer may be useful, but ultimately it can flake. If the op-conductor is able to insert this committed payload content, then it can recover from the missing-block case. Even then, the replica does need to make it a canonical block. If it's timing out / not doing a final forkchoice update, then something in the block-processing itself is wrong or is hitting a performance issue, and that then needs further investigation before we can fix it.
Thank you for filing this bug report, this information is very helpful.
The
nextActionOK
handling in the sequencer should be improved, I will look into what can be done there.The leadership transfer is more difficult: with async-gossip, when the payload has been published already, it becomes the canonical block required to continue sequencing. If other replicas don't pick up on this block via p2p before getting the leadership transfer, then a re-attempt of leadership-transfer may be useful, but ultimately it can flake. If the op-conductor is able to insert this committed payload content, then it can recover from the missing-block case. Even then, the replica does need to make it a canonical block. If it's timing out / not doing a final forkchoice update, then something in the block-processing itself is wrong or is hitting a performance issue, and that then needs further investigation before we can fix it.
Regarding leadership transfer:
ForkChoiceUpdateRequest
later. => a immediate thought is to allow conductor to bypass the check (keep other behaviors the same), but not sure if that is the most elegant solution@protolambda Hey proto, curious if you got time to take a look at this?
Bug Description
Base Sepolia experienced a ~1 hour chain halt at Oct 14th, 2024. Upon investigation, the sequence of events are listed below
16571938
failed to insert execution payload: failed to execute payload: context deadline exceeded
16571937
, which does not match16571938
cannot add duplicate payload 0xb7d2dea272494a387e4dc447b681dfd42f043e360a2a51814ecce260eed85a1c:16571938
16571938
Steps to Reproduce
With HA setup, construct a big block that takes longer than 10s (timeout of NewPayload in v1.9.3)
Environment Information:
op-node: v1.9.2
Issues / Expected behavior
There are 2 issues here:
sequencer.nextActionOK
might still be the false andsequencer.nextAction
might still be previous time where after resuming sequencing under some circumstances, such as this condition where next payload is restored from Raft consensus.⚠️ Notice: Issues that do not include the following sections will be subject to closure:
Please ensure all required sections are filled out accurately to expedite the debugging process and improve issue resolution efficiency.