ChainSafe / lodestar

🌟 TypeScript Implementation of Ethereum Consensus
https://lodestar.chainsafe.io
Apache License 2.0
1.17k stars 287 forks source link

Attestations are skipped for slots where previous slot tasks take > `SECONDS_PER_SLOT` #5314

Open nflaig opened 1 year ago

nflaig commented 1 year ago

Lodestar VC consciously skips attestations for slots if the previous slot tasks take > SECONDS_PER_SLOT due to runAtMostEvery function.

https://github.com/ChainSafe/lodestar/blob/f136f952ddb861976b3cc9484914c49a06ee8d90/packages/validator/src/util/clock.ts#L90

Rational behind this

If a task happens to take more than one slot to run, we might skip a slot. This is unfortunate, however the alternative is to always process every slot, which has the chance of creating a theoretically unlimited backlog of tasks. It was a conscious decision to choose to drop tasks on an overloaded/latent system rather than overload it even more.

The rational is understandable, however, based on my observations, the overload will most likely not be on the VC side but rather on the BN side in which case the BN should make sure to not accept more tasks. If runAtMostEvery is just implemented to "protect" the BN we should consider removing it. We should also avoid making assumptions on what BN implementation we are dealing with, might be a DVT middleware or any other CL (lighthouse, prysm, etc.).

Other VCs seem to not care if the tasks of the previous slot are still pending and just start tasks for the next slot. I think we should follow the same approach in Lodestar VC to make it more robust against (expected) errors such as timeouts.

Steps to reproduce

This issue was observed while testing Lodestar VC with DVT middleware Charon (https://github.com/ObolNetwork/charon) where getAggregatedAttestation would not return a response due to missing implementation of distributed aggregation selections which causes the attestation routine for slot N to take more than SECONDS_PER_SLOT due to the VC default http timeout of SECONDS_PER_SLOT causing the VC to skip slot N + 1.

Current behavior

Attestation routines are not properly decoupled. Issues (such as timeouts) in an attestation routine for slot N can lead to missing ALL attestations in slot N + 1.

Expected behavior

Attestation routines for each slot should be independent/decoupled from each other. Issues in a previous routine should not affect the next routine.

Related https://github.com/ChainSafe/lodestar/issues/5315

twoeths commented 1 year ago

yes this was implemented at the time when we didn't have vc metrics, now vc metrics don't show any apis that may cause the lag, I think we can remove that feature and do attestation duties for every slot cc @dapplion

dapplion commented 1 year ago

Not opposed to removing it, but just be aware that our beacon node does not implement any overload protection at the API level

nflaig commented 1 year ago

our beacon node does not implement any overload protection at the API level

Other VCs don't implement such protections, we should not make assumptions to which VC (or multiple) the BN is connected. This even makes it more important that we remove this from the Lodestar VC so we get better insights in BN performance and if required implement some sort of protection there.

Edit: it seems like Lighthouse also implements this in their duty service.

philknows commented 11 months ago

We should log a warning on the VC that it's skipping, or we do some more investigation here if we decide to remove the skipping.

philknows commented 11 months ago

We should look at removing due to previous assumption being invalidated from when the node suffered from really bad I/O lag.

wemeetagain commented 6 days ago

The check isn't hurting anything, and in the best/worst case, it helps the node recover. I'd be in favor of leaving this as-is.

nflaig commented 6 days ago

We should log a warning on the VC that it's skipping

Is probably still a good idea, just to not skip a slot silently