ethereum-optimism / optimism

Optimism is Ethereum, scaled.
https://optimism.io
MIT License
5.66k stars 3.3k forks source link

op-node: issue with sequencer stop / start behavior #12448

Open 0x00101010 opened 1 month ago

0x00101010 commented 1 month ago

Bug Description

Base Sepolia experienced a ~1 hour chain halt at Oct 14th, 2024. Upon investigation, the sequence of events are listed below

Steps to Reproduce

With HA setup, construct a big block that takes longer than 10s (timeout of NewPayload in v1.9.3)

Environment Information:

op-node: v1.9.2

Issues / Expected behavior

There are 2 issues here:


⚠️ Notice: Issues that do not include the following sections will be subject to closure:

Please ensure all required sections are filled out accurately to expedite the debugging process and improve issue resolution efficiency.

threewebcode commented 1 month ago

The timeout stops state update correctly.

protolambda commented 1 month ago

Thank you for filing this bug report, this information is very helpful.

The nextActionOK handling in the sequencer should be improved, I will look into what can be done there.

The leadership transfer is more difficult: with async-gossip, when the payload has been published already, it becomes the canonical block required to continue sequencing. If other replicas don't pick up on this block via p2p before getting the leadership transfer, then a re-attempt of leadership-transfer may be useful, but ultimately it can flake. If the op-conductor is able to insert this committed payload content, then it can recover from the missing-block case. Even then, the replica does need to make it a canonical block. If it's timing out / not doing a final forkchoice update, then something in the block-processing itself is wrong or is hitting a performance issue, and that then needs further investigation before we can fix it.

0x00101010 commented 1 month ago

Thank you for filing this bug report, this information is very helpful.

The nextActionOK handling in the sequencer should be improved, I will look into what can be done there.

The leadership transfer is more difficult: with async-gossip, when the payload has been published already, it becomes the canonical block required to continue sequencing. If other replicas don't pick up on this block via p2p before getting the leadership transfer, then a re-attempt of leadership-transfer may be useful, but ultimately it can flake. If the op-conductor is able to insert this committed payload content, then it can recover from the missing-block case. Even then, the replica does need to make it a canonical block. If it's timing out / not doing a final forkchoice update, then something in the block-processing itself is wrong or is hitting a performance issue, and that then needs further investigation before we can fix it.

Regarding leadership transfer:

  1. conductor did indeed try to insert this committed payload content into op-node if it does not have it already (only for the latest block)
  2. currently the issue here is not the timeout, it has already received p2p payload and tried to insert it (timed out), it is that when op-conductor tries to insert the payload again, because the payload is already inside the payloadsQueue, it will return immediately without calling ForkChoiceUpdateRequest later. => a immediate thought is to allow conductor to bypass the check (keep other behaviors the same), but not sure if that is the most elegant solution
0x00101010 commented 1 month ago

@protolambda Hey proto, curious if you got time to take a look at this?