Nodes can't recover from large L2 unsafe head gaps using p2p req resp sync

anacrolix commented 1 week ago

op-node will request gaps between the current head and the L2 unsafe head using the req resp (request-response) "alt sync" protocol if the blocks don't arrive via gossip. When there are network or service issues that cause stalls in gossip for more than roughly a minute, gossip will be rejected or not contain blocks needed to catch up, and nodes will enter a pathological cycle of being unable to obtain the blocks they need if most of the peers they are connected to also don't have the blocks. This is particularly bad when the sequencer becomes unavailable, because it will continue to produce blocks despite no other nodes being connected. When connectivity is resumed, all other nodes will be behind.

In the req resp arrangement, "client" is the requester, and "server" is the one receiving the message. The current req resp algorithm randomly requests blocks from peers, and has several undesirable properties:

Servers are scored down if they rate limit the client. This is particularly problematic if the peer is the only node with the blocks we need.
Clients penalize servers if the client runs out of quarantine space. (Quarantine space is used to buffer blocks that aren't yet ready to be sent to the driver).
Clients do not retry blocks that aren't available from the first server they try. Instead the driver eventually times out and might send out another batch of requests.
Requests are sent in the order they can be "promoted" to the driver, but there's no guarantee of this, and the received blocks can overflow the quarantine cache, and the driver processing queue, causing servers to be downscored and blocks to be wasted.
Changes in the block range in the "gap" do not flush the trusted block hashes. So in the event of a reorg, the sync client (the process that manages the req resp process with peers), will continue to promote blocks that should no longer be trusted.

zhiqiangxu commented 6 days ago

This issue can be fixed by using the p2p.sync.onlyreqtostatic flag introduced here.

anacrolix commented 6 days ago

I wondered where that static code came from. You'll be pleased to learn the PR mitigates the need for the flag.

zhiqiangxu commented 6 days ago

Yeah the ultimate goal is the same: to find trusted nodes to sync, either manually by p2p.sync.onlyreqtostatic, or automatically with your change :)

ethereum-optimism / optimism

Nodes can't recover from large L2 unsafe head gaps using p2p req resp sync #11779