Open nflaig opened 1 year ago
yes this was implemented at the time when we didn't have vc
metrics, now vc
metrics don't show any apis that may cause the lag, I think we can remove that feature and do attestation duties for every slot
cc @dapplion
Not opposed to removing it, but just be aware that our beacon node does not implement any overload protection at the API level
our beacon node does not implement any overload protection at the API level
Other VCs don't implement such protections, we should not make assumptions to which VC (or multiple) the BN is connected. This even makes it more important that we remove this from the Lodestar VC so we get better insights in BN performance and if required implement some sort of protection there.
Edit: it seems like Lighthouse also implements this in their duty service.
We should log a warning on the VC that it's skipping, or we do some more investigation here if we decide to remove the skipping.
We should look at removing due to previous assumption being invalidated from when the node suffered from really bad I/O lag.
The check isn't hurting anything, and in the best/worst case, it helps the node recover. I'd be in favor of leaving this as-is.
We should log a warning on the VC that it's skipping
Is probably still a good idea, just to not skip a slot silently
Lodestar VC consciously skips attestations for slots if the previous slot tasks take >
SECONDS_PER_SLOT
due torunAtMostEvery
function.https://github.com/ChainSafe/lodestar/blob/f136f952ddb861976b3cc9484914c49a06ee8d90/packages/validator/src/util/clock.ts#L90
Rational behind this
The rational is understandable, however, based on my observations, the overload will most likely not be on the VC side but rather on the BN side in which case the BN should make sure to not accept more tasks. If
runAtMostEvery
is just implemented to "protect" the BN we should consider removing it. We should also avoid making assumptions on what BN implementation we are dealing with, might be a DVT middleware or any other CL (lighthouse, prysm, etc.).Other VCs seem to not care if the tasks of the previous slot are still pending and just start tasks for the next slot. I think we should follow the same approach in Lodestar VC to make it more robust against (expected) errors such as timeouts.
Steps to reproduce
This issue was observed while testing Lodestar VC with DVT middleware Charon (https://github.com/ObolNetwork/charon) where
getAggregatedAttestation
would not return a response due to missing implementation of distributed aggregation selections which causes the attestation routine for slot N to take more thanSECONDS_PER_SLOT
due to the VC default http timeout of SECONDS_PER_SLOT causing the VC to skip slot N + 1.Current behavior
Attestation routines are not properly decoupled. Issues (such as timeouts) in an attestation routine for slot N can lead to missing ALL attestations in slot N + 1.
Expected behavior
Attestation routines for each slot should be independent/decoupled from each other. Issues in a previous routine should not affect the next routine.
Related https://github.com/ChainSafe/lodestar/issues/5315