ethereum / beacon-APIs

Collection of RESTful APIs provided by Ethereum Beacon nodes
https://ethereum.github.io/beacon-APIs/
Creative Commons Zero v1.0 Universal
328 stars 167 forks source link

Allow to bypass validation checks for `publishBlockV2` #451

Closed StefanBratanov closed 3 months ago

StefanBratanov commented 3 months ago

When creating blocks locally, most of the gossip checks would have already been covered by the process of creating the block itself (at least in Teku), so it makes sense to add an option to bypass validation checks altogether and publish the block immediately. This way there would be no additional delays (even if small) of broadcasting the block to the network when using v2 for publishing.

nflaig commented 3 months ago

When creating blocks locally, most of the gossip checks would have already been covered by the process of creating the block itself (at least in Teku)

This is only true if the block is published via the same beacon node that created it, in which case running gossip validation again does not make much sense. In a multi node setup, does this mean the validator client would have to keep track of which node produces the block and to which it is published, and set broadcast_validation accordingly?

In Lodestar, we already skip gossip validation if the block is published to the same beacon node but it's not decided by the validator client but rather the beacon node which has a cache of produced block roots, and if it was produced locally, we skip gossip validation.

StefanBratanov commented 3 months ago

When creating blocks locally, most of the gossip checks would have already been covered by the process of creating the block itself (at least in Teku)

This is only true if the block is published via the same beacon node that created it, in which case running gossip validation again does not make much sense. In a multi node setup, does this mean the validator client would have to keep track of which node produces the block and to which it is published, and set broadcast_validation accordingly?

In Lodestar, we already skip gossip validation if the block is published to the same beacon node but it's not decided by the validator client but rather the beacon node which has a cache of produced block roots, and if it was produced locally, we skip gossip validation.

I agree skipping validation in the beacon node which created it is the valid scenario in this case. But I can argue that it also makes sense in a multi-node setup. The beacon node which created the block would have already published the block, so if for some reason, other beacon nodes fail the gossip validation, the result would be that the block would just be propagated bit slower and they can get downscored a little bit in those rare cases. Ultimately, I think it's nice to have the option to not do any validation on the VC level. Since it's not a default option, clients can choose to use it however they want.

tersec commented 3 months ago

What is the magnitude of these validation delays?

michaelsproul commented 3 months ago

In Lighthouse I'm pretty sure the delay from gossip verification is trivial (around ~5ms). For us it was the code complexity of adding gossip verification and dealing with duplicate caches that was hard. But we can't remove this complexity while gossip verification is an option, so IMO we may as well keep it by default

Also open to adding the no_verification param if it is beneficial in clients. I think we just need to know if we're hunting performance or simplicity

arnetheduck commented 3 months ago

Ditto nimbus, it's (very) fast and nowadays just part of the "incoming external data" flow that applies to anything going out on gossip (and into the block database) whether that be re-broadcasting or rest-based data - even if a flag was added, it is likely we would do the checks anyway simply because it's more simple this way and provides a cheap bug safety net.

StefanBratanov commented 3 months ago

What is the magnitude of these validation delays?

In Teku, gossip validation seem to average around ~11ms (on a highish spec node). But occasionally, there are some outliers which take longer. For us specifically, for local blocks, we would like to skip it altogether and would be cleaner to do on the VC side. But if for other clients, there is no value in having an option to do no validation for this endpoint, I am happy to implement similar functionality as Lodestar that @nflaig mentioned.

nflaig commented 3 months ago

For us specifically, for local blocks, we would like to skip it altogether and would be cleaner to do on the VC side. But if for other clients, there is no value in having an option to do no validation for this endpoint, I am happy to implement similar functionality as Lodestar that @nflaig mentioned.

Imo doesn't hurt to have this option, even though we will not be using this in production, we had a use case for it during testing and Lodestar already supports a none value, it can only be configured internally right now and not via CLI. If we keep gossip as default, I don't see an issue other than clients not supporting it right now, if we wanna be strict about it would require to bump the api, but that seems a bit overkill to me.

StefanBratanov commented 3 months ago

For us specifically, for local blocks, we would like to skip it altogether and would be cleaner to do on the VC side. But if for other clients, there is no value in having an option to do no validation for this endpoint, I am happy to implement similar functionality as Lodestar that @nflaig mentioned.

Imo doesn't hurt to have this option, even though we will not be using this in production, we had a use case for it during testing and Lodestar already supports a none value, it can only be configured internally right now and not via CLI. If we keep gossip as default, I don't see an issue other than clients not supporting it right now, if we wanna be strict about it would require to bump the api, but that seems a bit overkill to me.

I like none better than no_validation. Modified the PR.

tersec commented 3 months ago

Imo doesn't hurt to have this option, even though we will not be using this in production, we had a use case for it during testing and Lodestar already supports a none value, it can only be configured internally right now and not via CLI. If we keep gossip as default, I don't see an issue other than clients not supporting it right now, if we wanna be strict about it would require to bump the api, but that seems a bit overkill to me.

What's the utility of specifying an option only useful for internal testing as part of the beacon API?

Also, to Jacek's point, a bit more concretely, Nimbus routes all these functions through (for those who might want to see details) routeSignedBeaconBlock, which does this validation as an integral part of block sending, separately from the specifics of any given beacon API function. While it's possible to adds a bypass of this, it would indeed complicate the funnel-everything-into-uniform-processing design for little apparent gain.

So far this PR seems surprisingly unmotivated by specific scenarios or use cases to warrant this, and handwaving about, well it's harmless or not that complicated ("I don't see an issue other than clients not supporting it right now") still creates a situation where there's an eventual expectation to implement this.

I'll be a bit more pointed: even if it's 20ms or 30ms, say, why are people waiting to send blocks late enough any of this matters?

(If it's because that's what MEV timing game players want, well.)

nflaig commented 3 months ago

So far this PR seems surprisingly unmotivated by specific scenarios or use cases to warrant this, and handwaving about, well it's harmless or not that complicated ("I don't see an issue other than clients not supporting it right now") still creates a situation where there's an eventual expectation to implement this.

I didn't wanna imply "not that complicated" to implement, from this PR, it seems for Teku it would be lower complexity to adopt it via query param instead of implementing something like Lodestar. For Nimbus, it seems higher complexity which is totally fair. That's why we have this discussion, to get a better picture / different perspectives and trade offs.

I'll be a bit more pointed: even if it's 20ms or 30ms, say, why are people waiting to send blocks late enough any of this matters?

I am not running Teku in my setup right now, but if I were I would like my locally built blocks to be gossiped faster if there is no clear downside to it. We shouldn't avoid optimizations to local block building just because most of the network is using MEV-boost, there are still a lot of people (especially solo stakers) which do not use it or have really high min-bid, and for those, reducing the internal latency is even more important as they might not have the best internet connection.

StefanBratanov commented 3 months ago

One of the main inspirations for us was that v1 would eventually be deprecated and we want to keep the same functionality as of today for locally created blocks whilst still using v2 (Teku VC still uses v1).

tersec commented 3 months ago

One of the main inspirations for us was that v1 would eventually be deprecated and we want to keep the same functionality as of today for locally created blocks whilst still using v2 (Teku VC still uses v1).

Can't speak to specifics of non-Nimbus implementations on how exactly the validation is handled here, but https://ethereum.github.io/beacon-APIs/?urls.primaryName=v2.5.0#/Beacon/publishBlock states that:

Instructs the beacon node to broadcast a newly signed beacon block to the beacon network, to be included in the beacon chain. A success response (20x) indicates that the block passed gossip validation and was successfully broadcast onto the network. The beacon node is also expected to integrate the block into state, but may broadcast it before doing so, so as to aid timely delivery of the block. Should the block fail full validation, a separate success response code (202) is used to indicate that the block was successfully broadcast but failed integration. After Deneb, this additionally instructs the beacon node to broadcast all given blobs.

Emphases added.

As specified, publishBlock (V1, retroactively) already involves validation. Some BNs may broadcast it before doing so, but that's framed as a non-default, optional behavior.

tbenr commented 3 months ago

Thanks all for the good feedback! We internally started the conversation on this topic several times, without reaching an actual decision. But reading throughout the thread I'm know lean to not having this option but instead leave the BN to responsible for eventual optimizations (like caching ala Lodestar implementation @nflaig).

Question: how you deal with [IGNORE] The block is not from a future slot rule? Are you considering this as a gossip validation failure? It won't be a problem if we add the "local built blocks cache", but in a multi-BN scenario, if VC deals only with lagging BNs it might lose the proposal.

My current feeling is to leave it as a validation failure (current teku behaviour). In general all IGNORE are considered failure with the exception of "already seen block".

nflaig commented 3 months ago

Question: how you deal with [IGNORE] The block is not from a future slot rule? Are you considering this as a gossip validation failure?

part of gossip validation in Lodestar and considered a failure

In general all IGNORE are considered failure with the exception of "already seen block".

that's our current logic as well

StefanBratanov commented 3 months ago

Thanks for the discussion all. :) We will implement similar functionality as Lodestar, so none validation option wouldn't be necessary. If at some point, clients think it could be useful option, we can always refer to this PR. Closing for now.