Do not lock on execution when importing blocks

When currently syncing blocks, the CL notifies the EL about a new payload, after checking the validity it puts the block in forkchoice and after that it notifies the EL again of a new head. This workflow made a lot of sense at the time of the merge, where increased forking and possible attacks from miners could lead to a situation where an optimistically syncing node would deadlock.

In the current merged network the above workflow is unnecessarily inefficient, locking both the CL and the EL in what should be independent checks. This proposal is about discussing mechanisms that will allow the CL and EL to perform these checks independently. One possible path would be the following path for block validation

Validate the gossip rules for block propagation
Notify the engine of the new payload (do not lock on this notification)
Perform BLOCKHASH verification of the block within the CL.
Insert the block directly into forkchoice and evaluate new head (allow forkchoice to keep track of which blocks have been fully validated vs those that were just notified, this can simply repurpose the current "optimistic" status of nodes).
Notify the EL in case of a new head (again not locking).

The EL can immediately reply that a block is invalid if its parent is known to be invalid. The CL can retroactively change head to a valid one if either the call in 2 or 5 eventually return with invalid.

Obviously we need to change the way that duties are performed: proposals are on top of fully validated blocks, attestations the same. So perhaps the simplest form is to have get_head return the last fully validated block in the branch of the current head (or compute head filtering out all nodes not fully validated, the former seems simpler).

cc @mkalinin @gakonst @fradamt

At first glance, it seems this proposal does not change the behavior of how the CL optimistically tracks blocks. That is, even with this change a CL with a functioning EL attached would return the same head block as today.

Can you explain the difference?

Thanks for the question, I'll add more context on the kinds of changes this would entail. There are some design decisions still to be made, and I am not strongly advocating for some of those decisions that I'll include below.

Currently the CL would not even considered for forkchoice a block unless the EL has returned from notify_NewPayload with either VALID, ACCEPTED or SYNCING. If the EL returns any of these values, then the CL imports the block for forkchoice and that block is treated, for the purposes of computing the current head, as any other block. That is, at any given time, the head returned from forkchoice could be a block that was imported optimistically (ACCEPTED or SYNCING replies above). In the even that a call to get_head returns an optimistic block, the node is said to be in optimistic mode and it cannot serve duties, that is, such a beacon will not provide a block to propose nor an attestation to sign.

On the other hand, if the EL does not return anything (for example if the EL is taking a long time to execute the block) or even errors out on an internal bug, the beacon does not import the block, and the current forkchoice is not changed. This node does not become optimistic, a call to get_head will return the previous head without changes and the beacon will happily provide blocks to propose on top of that previous head, or attestations to it.

This issue (or the accompanying PR if I get some traction) proposes to change the behavior so that even in the event that the EL is taking longer and does not reply, the block can be imported. It can be imported "optimistically" or can be imported with some extra annotation to show that the block was not validated by the EL. But the block will still count for forkchoice and therefore it will affect the result of get_head even in the second case.

The reason for this change is multi-purpose:

Faster CL validation of the block: Since currently we need to wait for the full execution of the block and forkchoice cannot change until at least we have heard back from notify_NewPayload, clients stay locked one way or another for a long period in which they can't reliably touch forkchoice. This is on the one hand an implementation detail (functional style programming allows you to have different copies of forkchoice with a large shared subtree or, what prysm currently does, you need to lock forkchoice until the block processing has happened).
Simplicity of code for the CL: in part due to the interplay between the CL and the EL, block validation itself becomes complicated because of all the possible combinations of returns from the EL. With the above change, the CL will simply validate the CL status for the block and include it in forkchoice. Then later
- If the EL returns that the block is invalid, the CL will remove those paths from forkchoice (code already present to deal with invalid blocks¨)
- If the node is requested for duties, then the node can check if it's head has been fully validated or not (code already present to check optimistic status).
There is a liveness component to consider here: if we still do not allow a beacon node to propose blocks when its head is not fully verified, then we would be in a situation where some blocks could be proposed in the previous behavior and some wouldn't with the proposed behavior (that is the EL is taking a long time to validate the currently syncing block, and the node is requested to propose a block).

Therefore I would tentatively favour that validators that are syncing this way do perform duties on top of their last fully validated block, perhaps with some reasonable conditions like being that this block is a descendant of the current finalized or justified checkpoint.

the block can be imported. It can be imported "optimistically" or can be imported with some extra annotation to show that the block was not validated by the EL. But the block will still count for forkchoice and therefore it will affect the result of get_head even in the second case.

I don't think it's a problem to do imports in some limited/annotated fashion, but I do think it's potentially bad for it to be the return value of get_head -- what contexts would this surface in? Duties -- bad. Users? also seems bad -- the head is not actually a representation of what the has/sees so the information can not actually be utilized (safely or at all dpeending on what they are doing). This also seems like a bad thing to surface in relation to any sort of fast confirmation rule.

I certainly think that it's safe to parallelize maybe more than Prysm is doing so today, but with the caveat that it doesn't surface in the fork choice. But maybe that's the primary gain you are looking for.

Curious @paulhauner's take as he might have some of the optimistic/etc dangers more available in close recall.

what contexts would this surface in? Duties -- bad.

It's not clear to me what the actual danger here there is: for attestations, this shifts the semantics of the head vote from "this is the valid head of the chain" to "this is the head of the CL chain and it's a direct child of the valid head of the EL chain". For proposals, there's zero change, you can only propose on top of valid blocks.

There's also a direct incentive into proposing valid blocks, as your block is directly reorged if your execution turns out to be invalid.

Users? also seems bad

I don't think there's practical changes here as well: a call to the execution RPC will return data from the current EL head which is validated. Nothing has changed from the EL perspective at all. In the CL case, a call to the head block will return the current CL block at the tip, which will be guaranteed to have a validated parent and current valid consensus transition. It may also have an annotation to check if its payload is fully validated yet or not. I think nothing changes here except for this extra annotation.

The main differences here would be:

The node is no longer considered optimistic if it's head is not validated, but rather if the parent is optimistic.
There may be FFG problems if the self-reorging block justifies a checkpoint and turns out to be invalid.

I think we have solved all the problems with 2 with the new filtering rules (also note that the depth of attestations to the invalid block is bounded to 1 committee). And 1 I think it's not any danger because of the above considerations

ethereum / consensus-specs

Do not lock on execution when importing blocks #3577