ethereum / go-ethereum

Go implementation of the Ethereum protocol
https://geth.ethereum.org
GNU Lesser General Public License v3.0
47.51k stars 20.11k forks source link

Block-building Latency issues and improvements #28242

Open protolambda opened 1 year ago

protolambda commented 1 year ago

So I've been looking closer into the engine-API block building code path, as part of optimizing the sequencing work with a 2-second block time in OP Mainnet and other op-geth based forks, and found several issues that I believe are worth improving in upstream geth.

The issues generally affect latency:

Issues

~### Unused empty blocks~

~Geth builds an "empty" block to fall back to when block-building is stopped prematurely. But the actual engine-API always calls ResolveFull, which contains a payload.cond.Wait() (here) that waits for a payload building iteration to complete (first call to payload.update).~

~Waiting for the full payload means that the "empty" payload essentially becomes useless: it's not the default ready-to-go block that reduces the engine API latency, but rather a thing that's only considered in tests (with the regular Resolve call that does not wait for a full payload), and not useful in the real code-path. There may be an argument that a beacon proposer wants to be greedy, and give up on a block if it cannot include EL fees, but that should never happen if the block-building time is accurate.~

~Op-geth uses the "empty" feature to reproduce blocks (i.e. no-tx-pool, only pure inputs), and partially solves the above (here) by marking the initial empty payload as a valid full payload. Yet, the sequencer still ends up building the empty payload, and then having to rebuild another thing. This might be somewhat fast, but is unused processing work.~

~Note that while the "empty" block was "cheap" to build originally, it now will now include withdrawals, and soon also a beacon-blcok-root EVM call. It's not "instant" anymore. And then has to calculate the state-root of the withdrawal and beacon-block-root storage changes.~

Unused placeholder is synchronous

So even though the above empty block is unused, it is still synchronous processing (forkchoiceUpdated calls api.eth.Miner().BuildPayload(args) whichs builds the empty block before creating any go routine), that blocks the forkchoice update method call before completing. Even though a payload ID for the response could have been constructed already.

This makes the method more likely to time out, and adds latency to the actual block-building work, which gets delayed by this (as the empty payload is ignored).

Fill or timeout, no control

The Engine API lends little control to block-building timing, especially when building large blocks at fast interval: to complete a block, it has to either:

A engine_getPayload call to stop the block building is not respected immediately: it signals the go-routine that is iterating to stop (no next iteration), but does not stop it right there and then (no interrupt), and waits for the block-building to time out (or fill with all gas).

And only when it's not the first full payload does it serve something immediately to the user. Otherwise the "stopped" payload building is still awaited to complete "fully". This means that a block response can be up to ~2 seconds late, since a full block can take that long to build: if the block-building started late into the slot, then it's still the first block building iteration, and then it is very unlikely to complete on time for the actual block proposal.

Being able to build blocks repeatedly and consistently at fast block-building time interval requires the building to stop when the request comes in, and not drag on.

Block-building delays are more likely to cascade into repeated block latency when a block finishes late, causes the next block to start late, which then finishes late, etc.

Slow block re-insertion

After a engine_forkchoiceUpdatedV call completes, the payload is retrieved with engine_getPayload (fill/timeout latency), and then has to be re-inserted into geth with engine_newPayload (adding more latency), which then does not appear to be using a common cache with the block-building, causing latency on the import. We can try to publish the payload before reinserting it, but ideally this last step is mostly cached and does not take nearly as long as it seems to do.

Proposed solution

I'm interested in implementing the above changes, and plan to open a draft PR for these. Feedback/corrections welcome!

karalabe commented 1 year ago

Very nice writeup, thanks!

Empty blocks being non-zero cost is an interesting new issue. I haven't thought about it until now, but you are right that withdrawal inclusions are non-zero cost now.

Wrt the empty block being unused. Are you sure? We are calling ResolveFull only for tests AFAIK. Otherwise we pass in a full parameter, which is false in all three GetPayloavV1/2/3.

Wrt constructing the empty block blocking FCU. I guess yes it will block. The idea here was that creating an empty block is mostly free so it's a good invariant that there's always something ready to feed to get. With empty blocks costing non-zero, I could imagine making this in the background.

For the timeout stuff, I'm unsure your premise is correct. You are assuming we always request full, whereas I understand the code not not request a full block. In that case, it would return the empty block and should always work. Now, we could argue that aborting the current execution and returning a partial block is better than the empty one. Still, aborting a running block might still entail a longer teardown time even if I abort EVM execution, since the withdrawals have to be applied at the end (per your original observation).

Wrt block reinsertion, that's interesting. IMO it would make sense to automatically insert any block created locally, since the head only updates on FCU anyway. Though here one interesting thing to explore is if we want to insert just the block get-ed (which would mean we have a very small window between get and newPayload to actually insert it), or if we should just branch out our tree with everything mined (i.e. if I make 1 empty and 3 full blocks, branch out all 4 and then pick one on fcu, pruning the rest eventaully).

karalabe commented 1 year ago

@protolambda PTAL https://github.com/ethereum/go-ethereum/pull/28257, it's a stab at the last part of the issue with teh slow reinsertion.

EDIT: By stab I mean experimental PoC. It may be the solution, but I have no way to prove based on my work load.

karalabe commented 1 year ago

@protolambda If you still care about this issue, please respond. There are too many open questions to do anything.

protolambda commented 1 year ago

@karalabe thanks for the reminder, I'm still interested. I'll take a look at the changes :+1:

protolambda commented 1 year ago

@karalabe

Wrt the empty block being unused. Are you sure?

Thanks for sanity-checking, the empty block is indeed used when necessary, I missed the Resolve/ResolveFull if/else. I updated the issue description to strike out the invalid part.

With empty blocks costing non-zero, I could imagine making this in the background.

This would be nice. Shall I implement this and open a separate PR for it?

Now, we could argue that aborting the current execution and returning a partial block is better than the empty one. Still, aborting a running block might still entail a longer teardown time even if I abort EVM execution, since the withdrawals have to be applied at the end (per your original observation).

Shall I open another PR, for a flag that allows partial block-building to complete, to avoid large-block timeouts while still including txs as much as possible? I think it can be beneficial to L1 when the node is overloaded with work when tx-processing is slower than expected.

. Though here one interesting thing to explore is if we want to insert just the block get-ed (which would mean we have a very small window between get and newPayload to actually insert it), or if we should just branch out our tree with everything mined (i.e. if I make 1 empty and 3 full blocks, branch out all 4 and then pick one on fcu, pruning the rest eventaully).

I think it's fine to assume only the highest-fee payload is worth preparing. But since it's not known upfront, it may be too expensive to flush as local block. Instead, maybe the uncommitted block-building can be retained in some in-memory cache? (or is the local-block thing optimized like that already?)