op-node: Improve shutdown behavior

sebastianst commented 10 months ago

Add (more) info-logging on shutdown
Revisit order in which components are shut down
- e.g. the we may want to stop sequencing before stopping p2p distribution of blocks
Fix https://github.com/ethereum-optimism/optimism/issues/8086

protolambda commented 10 months ago

op-node (and indexer / batcher) don't use BlockOnInterrupts anymore intentionally.

Instead, there's ctx := opio.WithInterruptBlocker(context.Background()) in the main.go to register (attached as retrievable value to the ctx, not immediately affecting any context cancellation) a single interrupt receiver that can then be shared between all CLI functions that run.

The sub/main-commands can then utilize the context either with ctx := opio.CancelOnInterrupt(ctx)(for very basic single-threaded termination) or for larger lifecycles, like op-node, the cliapp.LifecycleCmd will handle it.

The idea here is that instead of attaching a system-signal based blocker to the context, you can also attach an artificial blocker, to fake an interrupt on a CLI app without having to run sub-processes; great for testing and potentially app composition (with mocktimism wrapping CLI apps).

Note that this is also better than we had previously, because the interrupt receiver is registered on the main thread, almost as early as possible, before the CLI app.RunContext is called: we don't risk receiving an unhandled system signal and panicing. And it's the same consistent interrupt-channel, so we don't miss any signal when running one command after the other, nor if we do things like a 2nd-interrupt to request faster less-graceful shutdown (like op-node and op-batcher support).

sebastianst commented 10 months ago

Ah nice, thank's for clearing that up, TIL! Will edit to remove the opio suggestion.

sebastianst commented 10 months ago

Attempt to fix https://github.com/ethereum-optimism/optimism/issues/8086 by @ajsutton at https://github.com/ethereum-optimism/optimism/pull/8128

mslipper commented 9 months ago

Will be closed in https://github.com/ethereum-optimism/optimism/pull/8169 once Adrian updates to the latest op-geth version.

sebastianst commented 9 months ago

We should still address the first two bullets independently, which are not fixed by https://github.com/ethereum-optimism/optimism/pull/8169. E.g. we're still sequencing after p2p is shut down and we're missing shutdown logs.

sebastianst commented 1 month ago

After going with @anacrolix over the code, we identified one minor improvement that would mostly solve this: in the shutdown path, if sequencing is active, stop sequencing before shutting down the p2p stack.

anacrolix commented 1 month ago

I'm currently working on that. I have a potential PR from poking around the interrupt handling too.

anacrolix commented 1 month ago

I found a bunch of bugs in the peer discovery while testing this. I'll have a PR for that.

I modified op-node to stop sequencing before closing p2p. However the driver still tries to fetch L2 blocks from p2p after sequencing is stopped, I'm not sure this will work. Is it possible just to stop the driver completely instead? Or even the sequencer, then the driver, then the p2p?

sebastianst commented 1 month ago

Wouldn't it still be an improvement to just stop sequencing before shutting down the p2p stack? A sequencer would this way just behave like a normal node during shutdown from that point on. Probably more reordering improvements can be done, as you suggested, e.g. to shut down p2p later.

anacrolix commented 1 month ago

I don't quite follow. My testing shows p2p is still used until the driver is shutdown. So if it's okay to stop sequencing, then the driver, then p2p that should work its current form, but then do you really gain anything from shutting down sequencing first when you could just shut down the driver and get both for one?

ethereum-optimism / optimism

op-node: Improve shutdown behavior #10936