Open lukanus opened 1 year ago
My main concern with this approach is that it changes guarantees of when a registration would be available for a block builder, something the validator wants ASAP presumably. But the spec currently doesn't have a notion for that:
Validators should submit valid registrations well ahead of any potential beacon chain proposal duties to ensure their building preferences are widely available in the external builder network.
Following such a change, consensus clients would need to change their behavior so that they wait something like RELAY_REGISTRATION_VERIFICATION_EPOCHS
before verifying successful registrations. Which would benefit from a multi registration endpoint check. The tradeoff is a little more complexity on the validators behalf for substantial reduction of requirements on relays.
I'm not a fan of these proposed changes.
First, I don't subscribe to the initial problem statement.
More significantly, any interaction between the beacon node and builder network should be governed by this repository (builder-specs) only, and be part of the /eth/v1/builder
URL namespace. And moving the validator registration outside of it breaks the semantics.
Furthermore, the data API is an optional, read-only API that just provides information, which I think is how it should stay.
Hello @metachris :)
So this proposal addresses a general performance problem that comes from the uneven load to the relays, caused by all validator registrations are being requested in a very small time window. I totally understand that the mitigation of this would be problematic or even impossible. As we know, right now we have 480k validators and we can expect this number to grow overtime as the Ethereum ecosystem adoptability would increase.
So getting some numbers - using a benchmark similar to this (https://go.dev/play/p/qIkX32P72xU) running locally on my laptop (cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz) and tweaked CGO call to be called just two times per goroutine. For only 100k verifications running in "pseudo" parallel, I'm starting getting >2s times at the end of the test. And thats less than 1/4 of total validators in the network.
...
37.935837ms 37.728061ms 37.523928ms 2.9077741s 2.905973879s 39.033587ms 2.94450452s 253.90337ms 2.786395929s 2.807140238s 3.04535983s 309.566367ms 279.77344ms 312.924829ms 2.820837326s 2.750283127s 3.166437639s 3.339986995s 2.835711676s 2.723569409s 3.038397519s 2.576335097s 3.013019202s 38.933431ms 2.720233617s 2.362865309s 2.776145867s 3.681909227s 2.895910462s 2.915738048s 2.995064506s 3.113350071s 2.834942554s 2.728928558s 292.017123ms 2.780016954s 2.728749649s 2.727139215s 2.726702378s 42.569885ms 2.718879444s 2.696528209s 2.752178562s 2.699053796s 2.670845901s 2.765461856s 2.750473287s 2.602536422s 2.75083412s 3.633679448s 2.50511517s 2.415705863s 2.749493338s 2.660235112s 240.622069ms 3.043937598s 2.914697023s 2.751057663s 2.625962425s 2.751828603s 2.492199364s 2.504997482s 43.800332ms 2.330705368s 2.94870707s 2.601791192s 2.593560958s 2.749524767s 53.447823ms 4.264518309s 2.38180683s 2.377050874s 2.329720002s 41.080284ms 37.103973ms 39.119569ms 2.367715328s 2.345738463s 36.873379ms 2.624762982s 2.634
473542s 38.450864ms 2.584475202s 3.050292374s 38.447486ms 2.641822882s 61.061044ms 3.508749847s 43.896506ms 2.602160536s 42.757587ms 3.025659504s 2.818663048s 2.412669092s 2.661324843s 51.716491ms 3.0502516s 2.632646561s 3.051142488s 255.869857ms 60.742465ms 2.754216055s 2.624122268s 2.613799906s 56.339688ms 2.339848668s 2.472792414s 2.700297608s 2.698910354s 54.795769ms 61.692674ms 53.813275ms 2.754934288s 2.624789041s 2.4788049s 2.442373473s 2.360090553s 2.748827285s 2.196213833s 2.595349887s 2.475422718s 2.608756717s 52.989874ms 2.428796966s 2.605602633s
This comes from the fact that there is a finite number of cores that ultimately just gets flooded by number of calculations you need to do. So purely from the performance problem - that's pretty heavy bottleneck.
Answering your last concern first - The description of getValidatorRegistration endpoint in the only documentation (https://flashbots.github.io/relay-specs/#/Data/getValidatorRegistration) quite clearly states that this endpoint to "Check that a validator is registered with the relay" and that it is "Useful to check whether your own registration was successful." So it's exactly for this particular purpose.
So noone is suggesting any changes to the API other than:
If you desire - this does not require literally any changes in your implementation - other than adding a static status: "verified"
into validator registration endpoint - as flashbots relay would always verify it right away, and that's all.
However as signature verification is no longer necessary to be done upon registration this would open up many different approaches to solving the bottleneck.
Writing a document above, I did my best to create a change that would requite almost zero engagement from relay authors - and that would really solve and not just mitigate the CPU bottleneck. I understand that some ways of mitigating this performance problems but it should be in the hands of relay authors to decide wether they should use cache it or solve it the other way.
For the cache itself as currently the validator endpoint is publicly opened, if I'm not mistaken - any person can easily create an attack script that would submit a random payloads bypassing the cache "guards". And by doing this constantly you can make relay unresponsive with fairly small number of payloads using a cheap instance. And if you don't do this on submission process - as suggested above - it's impossible to make relay unresponsive because of this bottleneck.
I think that there may be a case for asynchronous verification of validator registrations, but before this the synchronous path should be optimized to see if it is really necessary as it does add a level of complexity for validators beyond the current "if I receive a successful registration message then I know the registration was good".
There is a lot that can be done with validator registrations beyond what exists in either dreamboat or the flashbots MEV relay today. Some notes on this:
map[[48]byte]*Pubkey
for public keys and a similar map for signatures avoids unnecessary recomputationmap[[48]byte]*SignedValidatorRegistration
, where the [48]byte
is the validator's public key, allows simple checks to be made against new validator registrations (e.g. if the fee recipient address and timestamp are the same then it can be flagged as valid without further checks; if the fee recipient address is the same but the timestamp later then it can be flagged as valid and the signature checked out of the critical path to decide whether or not to update the map)And of course concurrency should be used as much as possible to parallelize operations where large numbers of registrations are received at the same time.
It is also worth exploring the issue where there is a large spike of registrations around the epoch. There is, as far as I am aware, no technical reason why the registrations all need to arrive at this particular point in time (and indeed Vouch sends them roughly half way through the previous epoch). Making a change to spread the submission of validator registrations out could provide a solution in itself, and I have kicked off a discussion around this.
Abstract
This proposal removes immediate signature verification of new validator registrations, making the verification asynchronous. The information about verification status shouldn’t be returned from the registration endpoint any more, and instead queried from data API.
This change removes a CPU bottleneck in the relay, along with possible DOS attack vectors, and allows registration process to be resilient to high loads.
With the current process, relays don’t have an even load and relay operators need to use expensive infrastructure to cover the load spikes. As the signature verification failure status is not used in the flow, relaxing the spikes should greatly reduce relay’s maintenance costs, increasing the number of people who can afford running relays.
Motivation
This change addresses a number of problems and threats to the relay ecosystem that are the effect of verifying registration signatures on register validator submission.
The mainnet right now has more than 400k validators and we expect this number to grow with Ethereum adoption. There are existing mechanisms that on every epoch re-register existing validators, resending hundreds of thousands of validator registrations in the few first seconds of an epoch.
Verifying signatures is not a computationally trivial operation - to carry the CPU load a singular relay instance needs to be run on an oversized server that is not utilizing its computing power for the rest of an epoch.
The current implementation of [mev-boost](https://github.com/flashbots/mev-boost/blob/main/server/service.go#L276-L284) does not return error contents back to the validator on failed registration.
The verification of registration signatures is also not immediately used as the registrations are only used to be returned later by the api endpoint ([link](https://github.com/flashbots/mev-boost-relay/blob/174a4a66280aa0289551f61dbabbb17ec202c18d/services/api/service.go#L1420)).
Current network traffic characteristics are similar to a DDOS attack, as the current mechanism creates an attack vector where a bad actor sends its own registration slightly ahead of time and then floods the server with incorrect registrations. It’s also possible to completely clog the relay with just the number of new registrations.
Prior Art
The recurrent registration problem was reported a few times before, but the core of the problem was never satisfactorily resolved.
https://github.com/ethereum/builder-specs/issues/24
This change is meant to eliminate the CPU-bound performance problems, without changing the entire network’s behavior.
Detailed Description
Current Validator registration process:
This proposal aims to make the signature verification (step 4) asynchronous.
The only reason a good, lawful and honest validator may be concerned about its signature state is at the time of its initial or consecutive deployments, i.e. when configuration can change. There is no benefit to knowing it is still correct on every deployment.
Therefore, the information about the state of verification may be offloaded into a separate endpoint and removed from (POST)
/relay/v1/builder/validators
. It can be achieved by extending/relay/v1/data/validator_registration
with additional enum field -status
.Go (possible implementation)
The endpoint would then return:
The new process would assume that after successful initial verification (e.g. submission time, is known validator) every registration would be persisted with the default unverified state.
It should be left for the relay development team to decide on implementation details, however, the goal for the verification itself is to become “eventually verified”. This new flow would allow various improvements not limited to verifying signatures in the background process, throttling the number of parallel calculations, or calculating signatures only upon request.
For existing relay implementations it would still be possible to preserve verification calculations on submission and save as already verified.
Backward Compatibility
This change introduces a weak inconsistency for people who were expecting to find an error on incorrect signature - that should no longer be returned.
Change to the
/relay/v1/data/validator_registration
endpoint is additive - meaning there are no protocol-breaking changes.In existing codebases, the value can still be calculated in the same place as it was before and saved as already verified. So no big immediate changes should be needed.
Dependencies
This proposal doesn’t depend on any other work.
Risks and Security Considerations
There is no standard of the multi-validator registration process - it is unclear how relays should behave upon a failure of one validator. From the user perspective, it’s undesirable to fail all validators in the payload if one has a broken signature. This change allows all validators that passed pre-check to be registered and discarded only when needed.
Rationale and Alternatives
As described above this change targets performance improvements for good, lawful and honest validators, as signatures may only change during the deployment of a new configuration, followed by a process restart. There is no benefit to verifying that one’s signature is still correct for every request. Furthermore, current implementations of processes like mev-boost would not return this information back to the validator either.
This simple change can allow relays to use smaller servers, utilizing more of the cpu idle time - as we no longer have 3s window for verifying >400k signatures - allowing more people to afford running relays.