ethereum-optimism / optimism

Optimism is Ethereum, scaled.
https://optimism.io
MIT License
5.65k stars 3.28k forks source link

Sequencer halting on temporary error #12240

Open sebastianst opened 1 month ago

sebastianst commented 1 month ago

From an internal report by @mdehoog:

I'm running an L3 sequencer locally (using Base Sepolia as the "L1") and noticed that block building halts pretty consistently, after the following logs:

t=2024-10-01T23:05:36+0000 lvl=info msg="Started sequencing new block" parent=0xe3411af59e2177377734ab199d2a8698f07db2a4cb11fdf80dbf228c21ef6024:209 l1Origin=0x74f255a508a8219038aa95e56e39de5aaf7b3b5781b9033b404bcd53a7d1c266:16027824
t=2024-10-01T23:05:36+0000 lvl=warn msg="Engine temporary error" err="temp: failed to fetch L1 block info and receipts: querying block: not found"
t=2024-10-01T23:05:36+0000 lvl=debug msg="Engine reported temporary error, but sequencer is not using engine" err="temp: failed to fetch L1 block info and receipts: querying block: not found"

after this temporary error, there are no more "Sequencer action" logs. I'm a little nervous about this happening in prod. Can anyone explain why we don't schedule another action in this conditional? https://github.com/ethereum-optimism/optimism/blob/73038c881b48a591c216c880d946f41efb185a32/op-node/rollup/sequencing/sequencer.go#L385-L386

My initial take: I think what happens is that when the sequencer enters startBuildingBlock, the building state in d.latest is cleared because of a previous d.onPayloadSuccess. It then hits a temp error and returns, but it never set any field in the d.latest BuildingState, which only happens at the end right before emitting a BuildStartEvent. This temp error then lands in onEngineTemporaryError where it checks if there's any non-zero BuildState at d.latest to make the decision whether the "sequencer is using the engine", and then returns early because it's still clear, so no future action is scheduled.

bearpebble commented 1 month ago

Hey @sebastianst, your analysis is correct. I also reported this about two weeks ago btw :grimacing: See the issue description for an easy way to reproduce it as well https://github.com/ethereum-optimism/optimism/issues/12041

sebastianst commented 1 month ago

Hey @bearpebble sorry your report got missed! I've attempted a simple fix with https://github.com/ethereum-optimism/optimism/pull/12258 and created an op-node docker image with tag v1.9.4-dev.0. Feel free to also chime into the discussion on Discord.

emilianobonassi commented 1 month ago

to add, we've been experiencing sequencer halt also when l1 becomes unavailable for more than a sequencer drift time emitting a L1TemporaryErrorEvent

not sure if being masked by an enginetemporaryerror. will try to reproduce using the latest build by @sebastianst