specify liveness failsafe for builder network

ralexstokes commented 2 years ago

this PR specifies one version of a "circuit breaker" to reduce the scope of the failure domain in the event of liveness faults on chain

note to implementers: you can just count the distinct number of block roots from the BeaconState to get a stream of the proposed inputs ("M slots over a window of N slots")

terencechain commented 2 years ago

note to implementers: you can just count the distinct number of block roots from the BeaconState to get a stream of the proposed inputs ("M slots over a window of N slots")

This will count orphaned blocks as missing, no?

ralexstokes commented 2 years ago

I feel like the outcome on the call was that we would just stick to the canonical chain as it would be simpler to implement -- just look at the canonical beacon state

and to mitigate any sort of reorg'ing attacks we now want to widen the window to trigger the breaker

and if the current suggested range for ALLOWED_FAULTS feels too small we can bump it up (and possibly also widen the rolling window)

terencechain commented 2 years ago

I feel like the outcome on the call was that we would just stick to the canonical chain as it would be simpler to implement -- just look at the canonical beacon state

I don't have an issue with this. For prysm it's equally simple looking from the canonical chain's perspective or combining canonical and forked chain's perspectives. We'll stick with counting missing slots as they are truly missed. I think client diversity is a nice to have here

ralexstokes commented 2 years ago

yes, I think the first goal is: "clients have something implemented" if only to dissuade potential actors who would abuse this attack vector

the second goal is: "do clients in aggregate implement something that is hard to attack?" -- this is where it matters a bit who is doing what bc if someone has a more sensitive trigger then it could selectively be used to take the builder network offline

tersec commented 2 years ago

In terms of reducing the scope of the failure domain in the event of liveness faults on chain, given that https://github.com/remyroy/ethstaker/blob/main/MEV-relay-list.md documents multiple relay networks, it's not obvious that it's more effectively centralized than a still-near-Geth-monoculture on the EL side.

The implied calculation is that the most likely inference from a bunch of missed slots is that (a) a large portion of the network is using the builder API; and (b) the builder API is less reliable in a correlated-across-CL-nodes way than the engine API. (a) hasn't yet proven as true as some earlier predictions, and I'm not sure (b) is useful to bet on -- even if one relay network has problems, it's reasonable to try another, and mev-boost can already be configured to try multiple relay networks.

While local EL infrastructure for the engine API is definitely less centralized per se, that doesn't protect against already-witnessed situations where all or most EL instances of a certain type across a network fail to propose good blocks in some situation.

Rather, the builder and engine APIs act a kind of diversity in themselves, and given that the former isn't simply a FB interface anymore, responding to network trouble by decreasing decentralization seems risky.

ralexstokes commented 10 months ago

closing in lieu of #95

ethereum / builder-specs

specify liveness failsafe for builder network #47