Safeguards and mitigations to preserve liveness

come-maiz commented 2 years ago

We want to document the conditions related to mev-boost and the Flashbots relay that would affect the liveness of the blockchain, to make sure that they are prevented or mitigated.

We want to test these conditions live in a testnet.

come-maiz commented 2 years ago

The relay is offline or overloaded.

The main part of the sidecar design is to be able to use the local execution client to produce blocks. If the relay doesn't reply to mev-boost, the proposer will still be able to get a valid block, get attestations, and earn rewards.

TODO: check that all the consensus clients fall back to the local execution client when they get no reply to the getHeader or registerValidator calls.

question: what is the expected reply time from the relay? How much time will the proposer wait for an answer? Will there be plenty of time to execute the fallback code path?

question: should the local execution client build a block in parallel so it is ready in case the relay fails?

djrtwo commented 2 years ago

not just fall-back in event of no reply, but I would suggest running the build process in parallel and just not "getting the local block" if mev-boost is working properly. otherwise, abort mev-boost and get the locally built block

EDIT: just saw your last question. that's what I would suggest 👍

come-maiz commented 2 years ago

A trusted relay has a bug.

Because of the previous point, and since the relay is trusted, the relay team can just disconnect it while the bug is solved.

For the flashbots relay we will have two devops to cover most timezones, with their respective backup humans, and alerts to notify them when things are suspicious. When things are weird, the process should start with putting the relay offline while we understand the problem.

question: what are the conditions we should monitor to identify problems?

question: what weird things could start to happen if there is no trusted relay available for mev extraction?

question: what happens if the relay was trusted but now goes rogue or just can't or won't shut down?

come-maiz commented 2 years ago

proposers start using untrusted relays.

We have to inform everybody of the risks of using a relay that is not trustworthy either because their intentions are not clear, they are profit-maximizing above all else, they have no reliable uptime, or they are not careful enforcing that the builders are providing valid and sensible blocks.

We want all the people interested in running a relay to start by running a builder, so they understand all the challenges and get our support. See #145.

We can have a relay monitor, so when one proposer is affected, they can share the information and warn the others. See #142.

question: what makes a relay not trustworthy?

question: if a relay repeatedly misbehaves, should mev-boost or the consensus client discard it and force the operator to run a command to enable it again? How often will mev-boost interact with the relay? If this is not often, then the permanent disconnection could be too slow.

question: what metrics should be monitor? How do we translate this into a number that lets proposers evaluate the risk of using a specific relay.

question: what happens if the relay monitor fails or goes rogue?

come-maiz commented 2 years ago

a relay serves valid and sensible blocks, but they are slowly manipulating things that will build up to a take over :fire: :fire: :fire:

The relays have to be very strongly and constantly scrutinized by the searchers, builders and proposers. For a relay to be trusted it has to publish the data necessary for verifying its operation. https://flashbots.notion.site/Relay-API-Spec-5fb0819366954962bc02e81cb33840f5#38a21c8a40e64970904500eb7b373ea5 and https://github.com/flashbots/flashbots-data-transparency

A relay should get the blocks from a trusted builder or from a network of competing builders that is stable and not centralized.

question: what happens if the most profitable builder is shady? Like anonymous and not trustworthy, solely profit oriented. The short term economic incentive would be to use it, then it will get a majority and will own the blockchain.

:thinking: maybe the relay can rotate the builders, so none of them produces more than X% of the blocks. Not the best idea for profitability, but makes sense for long-term stability.

come-maiz commented 2 years ago

Important to note that not all validators will be using mev-boost. But there's a strong economic incentive for them to use it, so I expect the majority will. Will the percentage of clueless or not profit-maximizing validators be relevant for preventing collapse? How many of them should there be to play a relevant role on stability?

thegostep commented 2 years ago

Last January, I prepared a document outlining a set of proposed mev-boost security features that aim to address potential relay faults which can lead to liveness issues.

mev-boost in its current state does not mitigate these faults. Given they have the potential to lead to the chain stalling and failing to propose new blocks, I have put together in this post my thoughts on high priority mitigation paths ahead of the upcoming merge.

Please read the original document before continuing to read this post!

worst case scenario analysis

Let's look at a worse case scenario. We assume that at the merge, >90% of validators are running mev-boost and are exclusively connected to the Flashbots relay, and mev-boost is deployed in its current state.

A bug in the Flashbots relay could possibly lead it to have the following faults. Each fault can be analyzed as being "cascading" and "attributable". A cascading fault means that the validator of the current slot is not aware if the fault occurred to the validator in the previous slot. A non-attributable fault means that it is not possible to prove if the fault originated from validator or from relay misbehavior. Cascading faults are the most dangerous as they have the potential to impact chain liveness for extended periods of time. Attribution helps in mitigating cascading faults as fraud proofs can be constructed and programmatically used in a reputation system or circuit breaker, but it does not prevent the fault from occurring.

Reveal Withholding aka "missing data" (cascading, non-attributable)

A bug or degraded performance due to DOS or other infrastructure outage in the relay causes it to propose block headers to validators, but fail to reveal the block body in time for inclusion in the chain. This is non-attributable because it is impossible for the network to differentiate if it is the validator or the relay that is causing the reveal delay.

Invalid Block aka "invalid payload" (cascading, attributable)

A faulty relay simulation may cause it to send blocks that break consensus rules. This means it would reveal blocks to the network on time, but the blocks are not accepted by the attestation committees. This is an attributable fault because relays sign all the blocks they submit to validators.

Incorrect Block aka "inaccurate value" (cascading, attributable)

A faulty relay simulation may cause it to send blocks that have valid consensus rules, but misrepresent the value of the blocks. An extreme case of this fault would lead to validators proposing empty blocks, or for the relay or block builder not to pay the feeRecipient. This fault could cause deteriorated user experience, but would not cause a consensus liveness issue for the network. This is an attributable fault because relays sign all the blocks they submit to validators.

Questions:

For each of these faults, what is the worse case impact on the network and on validator's stake? Does reveal withholding have a different worse case impact than invalid blocks?
What is a tolerable network impact? How many missed slots is considered OK for a worst case scenario?
Which faults should be prioritized? Is an attributable liveness fault really better than a non-attributable liveness fault?

potential mitigations

Validators need a way to identify these faults and disconnect from the offending relay programmatically. This means turning worse case scenarios into attributable non-cascading faults.

Reveal Withholding appears to be the greatest threat and therefore priority to mitigate. The following mitigations will focus on this fault, but can be used for the other faults too.

Circuit breaker

A circuit breaker would be code implemented by the consensus client which says "disconnect from mev-boost if the network has not produced a block in X number of blocks". This requires the consensus clients to be able to inspect network traffic to identify when missed slots occur. In theory this should mitigate block withholding and invalid block faults by making them non-cascading. It does not make block withholding attributable.

Questions:

do consensus clients have the necessary information to identify when slots are missed? can it tell withholding apart from invalid payloads?
does this create an incentive for malicious validators to "grief" the system by deliberately producing X invalid blocks and causing the rest of the network to disconnect from mev-boost?
what X value should be selected?
when should consensus clients reconnection to mev-boost after the circuit breaker is triggered?

Relay monitoring

A relay monitor is a third party system that a validator connects to and delegates the responsibility of monitoring relay performance. If the relay monitor identifies a relay has induced any of the three faults, it can send a message to mev-boost of all validators to disconnect from this relay. The clear advantage of this approach over the circuit break on the consensus client is that is solves for all three faults types without limitation on the data that can be accessed. The obvious drawback is that it adds an additional trusted party to the system who can have faults and outages of its own. This additional trust can be mitigated by connecting to multiple independent relay monitors with an 1/n policy.

Questions:

can relay monitors be bribed to disconnect from certain relays?

Relay multi-sig

A relay multi-sig means that mev-boost would implement some logic which requires x/n policy or more relays to propose the same block header for the header to be considered valid and be released to the consensus client. In theory, this should reduce the risk of faults from occurring if relays are run by independent parties and have independent implementations. This does not however seem to help improve cascading fault or attribution in the worse case scenario if multiple relays have correlated faults.

Questions:

what could cause multiple relays to have simultaneous faults?
what is the impact of the incentive for relays to induce faults in each other?

Fraud proofs

Fraud proofs or payment proofs involve taking an attributable fault and generating a proof that is submitted to all other validators in the network to notify them to disconnect from a relay. It can be used to turn cascading, attributable faults into non-cascading faults. This means it is not helpful for the withholding issue but can be used alongside other mitigation techniques.

Questions:

which other mitigation techniques can fraud proofs or payment proofs be coupled with?

worse case fault recovery testing

Whichever mitigation is selected, it should be deployed and tested in a production environment by a super majority of mainnet node operators on diverse consensus clients. The test should simulate the worse case fault described above with 100% of node operators connected to the faulty relay and monitor that the chain is able to successfully recover and continue producing blocks.

Node operators should only whitelist a relay once it has successfully completed this test.

come-maiz commented 2 years ago

I'm interested on how we can design that worse case fault recovery testing.

Sepolia could have a representative sample of the node operators proportional to their stake in mainnet. And then we could coordinate all the known big node operators for this kind of testing. Would that make sense?

Or this would only be feasible in a lab simulated testnet?

@parithosh @lightclient @yoavw, any thoughts? Anybody else from your teams that would want to collaborate on this?

come-maiz commented 2 years ago

Here's one case in the category of trusted relay going rogue that we can't shut down.

What happens if the flashbots DNS is attacked and we lose control over the domain?

@sukoneck can we define and implement a policy for DNS changes on the mainnet relay that prevents a single employee for changing it, that prevents any customer support from the provider to change it, and that alerts of any changes? I've reported it in https://github.com/flashbots/infra/issues/105

mev-boost has to register the validator with the URL and the public key, and verify that every block received is signed with the corresponding private key.

metachris commented 2 years ago

I'm interested on how we can design that worse case fault recovery testing.

First we need the mitigation mechanism agreed and implemented in the CL clients.
Triggering the error is easy -- we can simply configure our relay to withhold every block on the submitBlindedBlock call. This test should happen on a testnet (or shadowfork) where 80% or more of the network uses the relay, and all CL clients would be represented. cc/ @paritosh

parithosh commented 2 years ago

Sepolia could have a representative sample of the node operators proportional to their stake in mainnet. And then we could coordinate all the known big node operators for this kind of testing. Would that make sense?

Sepolia has a permissioned validator set and while some of the staking pools are represented, I wouldn't say its proportional to mainnet. We'd like to keep the validator set small, so onboarding a lot of validators would be out of the question.

I'd say an ephermeral testnet or a shadow fork is probably the easiest way to co-ordinate this sort of testing. I'd be happy to help with either. I'm assuming your team is already in touch with all potential participants and I'd mainly have to provide configs and validator keys?

ralexstokes commented 2 years ago

question: should the local execution client build a block in parallel so it is ready in case the relay fails?

there is currently a 1 second timeout for mev-boost to fail to produce a block before the proposer moves to a local pathway: https://github.com/ethereum/builder-specs/blob/main/specs/validator.md#relation-to-local-block-building

my only hesitation with local building in parallel is if the resource cost hinders those who would otherwise run nodes, e.g. at-home stakers

although we should assume any proposer is sufficiently resourced to produce a block w/o the builder network and this kind of suggests we should update the directive in the builder-specs

parithosh commented 2 years ago

my only hesitation with local building in parallel is if the resource cost hinders those who would otherwise run nodes, e.g. at-home stakers

Would that really be the case? My understanding is that mev boost would just fire off a request to fetch the payload. Once the request is sent, there's no extra processing overhead. In the meantime, requesting the EL to generate the payload shouldn't be an extra overhead (considering that it's what needs to happen if mev boost didn't exist).

StefanBratanov commented 2 years ago

question: should the local execution client build a block in parallel so it is ready in case the relay fails?

there is currently a 1 second timeout for mev-boost to fail to produce a block before the proposer moves to a local pathway: https://github.com/ethereum/builder-specs/blob/main/specs/validator.md#relation-to-local-block-building

my only hesitation with local building in parallel is if the resource cost hinders those who would otherwise run nodes, e.g. at-home stakers

although we should assume any proposer is sufficiently resourced to produce a block w/o the builder network and this kind of suggests we should update the directive in the builder-specs

Actually currently in Teku we make an async request for an ExecutionPayload to the execution layer before requesting a header from builders. That way we can quickly fallback to a local block in case things go wrong with the builder flow. (timeouts, exceptions, validator not registered)

I am wondering shouldn't there be a timeout in the builder spec similar to BUILDER_PROPOSAL_DELAY_TOLERANCE (1s) for getting the payload from the builders? That way worst case scenario, block wouldn't get delayed too much and if a local ExecutionPayload is already available, the proposal could still happen in time.

thegostep commented 2 years ago

I think there an interesting case to be made for each client to implement mitigation as they see fit rather than the entire network adopting the same mitigation technique. Diverse mitigations might mean more network resilience against accidental outages, and higher cost of deliberate attacks. It would be great to keep a reference of the mitigations used by each client in this issue or in the mev-boost documentation (cc @0xpanoramix).

Here are some links to lighthouse and prysm: https://github.com/sigp/lighthouse/issues/3355 https://github.com/prysmaticlabs/prysm/issues/11109

ralexstokes commented 2 years ago

I am wondering shouldn't there be a timeout in the builder spec similar to BUILDER_PROPOSAL_DELAY_TOLERANCE (1s) for getting the payload from the builders? That way worst case scenario, block wouldn't get delayed too much and if a local ExecutionPayload is already available, the proposal could still happen in time.

if I'm following you, you are referring to a timeout on the call to get the complete payload from the builder after having already signed the bid

in this scenario, a proposer does not want to publish a competing block as it would be a slashable offence

I think building in parallel makes sense but the proposer should only ever make one (1) signature for a given slot

StefanBratanov commented 2 years ago

I am wondering shouldn't there be a timeout in the builder spec similar to BUILDER_PROPOSAL_DELAY_TOLERANCE (1s) for getting the payload from the builders? That way worst case scenario, block wouldn't get delayed too much and if a local ExecutionPayload is already available, the proposal could still happen in time.

if I'm following you, you are referring to a timeout on the call to get the complete payload from the builder after having already signed the bid

in this scenario, a proposer does not want to publish a competing block as it would be a slashable offence

I think building in parallel makes sense but the proposer should only ever make one (1) signature for a given slot

Yeah, I was referring to the payload call.

As for the building in parallel, when the proposer asks for a block, it should be either MEV or a local one depending on timeouts or any exceptions. There will be only one signature. The additional timeout for the payload call could help potentially with mitigating any malicious delays when requesting the payload.

metachris commented 2 years ago

mitigating any malicious delays when requesting the payload

I don't see how an additional timeout on the BN would help here. mev-boost tries to get the payload from all the relays, and as soon as it gets the payload from one it cancels the requests to other relays. This alone should mitigate any malicious delays from other relays. Otherwise mev-boost is using a 2 second relay timeout by default, configurable with -request-timeout. Perhaps that should be longer for getPayload calls 🤔

terencechain commented 2 years ago

mev-boost tries to get the payload from all the relays, and as soon as it gets the payload from one it cancels the requests to other relays.

Shouldn't this relationship be 1:1? My understanding is payload should come from the relay where mev-boost is called getHeader. Are we assuming builders broadcast the same payload to multiple relays so more than one relay?

ralexstokes commented 2 years ago

mitigating any malicious delays when requesting the payload. all this would really do is make sure the one caller is not affected, which I guess is worth considering but to my knowledge clients have timeouts across the entire proposal process so that would catch this already (?)

this feels like a thing that doesn't go into the spec

StefanBratanov commented 2 years ago

mitigating any malicious delays when requesting the payload

I don't see how an additional timeout on the BN would help here. mev-boost tries to get the payload from all the relays, and as soon as it gets the payload from one it cancels the requests to other relays. This alone should mitigate any malicious delays from other relays. Otherwise mev-boost is using a 2 second relay timeout by default, configurable with -request-timeout. Perhaps that should be longer for getPayload calls 🤔

I was referring to the Reveal Withholding aka "missing data" problem described above. If the beacon node is connected to a relay or has set a higher --request-timeout in the mev-boost component, and also haven't set a specific timeout for the payload request, It could lead to a missed block whether the delay from mev-boost/relays was malicious or not.

come-maiz commented 2 years ago

https://writings.flashbots.net/writings/understanding-mev-boost-liveness-risks/

flashbots / mev-boost