Prevent slashings from accidental double starts

dankrad commented 4 years ago

Starting a validator client for the same validator twice is of course a user error. However, it's probably the most common user error that leads to slashing. If the to clients are started at the same time, there is no way to catch this error; however, in all other cases, the second client being started can detect that the last attestation made by the validator does not match its local storage, and refuse operation (without an override flag).

Suggested solution:

add a BN API endpoint to query for the latest attestation by a given validator index
VC queries this endpoint on startup, if it does not match local storage, stops with warning

paulhauner commented 4 years ago

This is interesting, but it has significant complexity and is not comprehensive. For example:

If the duplicate validator client is being booted for the first time it will likely have no history and therefore cannot make this check.
- This is likely easily solved with a "I have no history" flag, however if you've just run a validator for the first time you're going to need to supply that flag and pressing up+enter on your terminal is going to get you that flag again.
It relies upon the BN storing the "latest attestation" for a validator. Although the latest attestation is required for fork-choice, it doesn't mean you have access to the whole attestation. Adding more caches to BNs means more complexity and more memory usage.
What to do if the BN "latest attestation" is earlier than the local storage one (this can occur if a VC is using multiple BN or if there's an unfortunate crash/power-outage).

If teams are looking at preventing concurrent signing I would first be looking at preventing it at the keys. For example, using lock-files next to keystores, remote signers with built-in slashing prevention (concurrent access is fine here) or hardware wallets that natively prevent concurrent access (don't exist yet).

Every time you introduce a complex scheme "above the keys" you also add risk of a false-positive that might cause downtime. I'm not saying this scheme isn't worth considering, but I think we need to approach higher-level (than the keys) schemes with caution.

EDIT: I just learned this was discussed on the call last night (I wasn't present) so my response is lacking some context, apologies.

paulhauner commented 4 years ago

Looking at this some more, another approach would be to wait n epochs and only start signing messages if no other message has been seen. This approach has the following differences to @dankrad's approach.

Client only needs to track if any attestation/block has been seen for the validator.
- Clients already need to track these metrics in order to obey the specs gossip propagation conditions, so it's minimally imposing.
- It relies on the right now instead of history, so it removes some edge-cases about nodes having different views of history.
It will mean the validator is offline for n epochs on reboot.
If you randomize n a little it also protects (sometimes) against starting two VCs at the same time.

(2) is the clear trade-off for the simplicity. I'm not saying this is a perfect solution but it's exploring the space.

dankrad commented 4 years ago

I like the idea because it also protects against double starts.

Note that I already tuned my system so that it does not require any extra history capabilities by the BN, since I am only referencing the last attestation (which needs to be kept for LMD Ghost already).

paulhauner commented 4 years ago

(which needs to be kept for LMD Ghost already)

Whilst we maintain information about that attestation for LMD Ghost we don't actually maintain the AttestationData object.

For Lighthouse all we can recall (without additional history capabilities) is the latest attestation.data.beacon_block_root and attestation.data.target.epoch for each validator.

EDIT: If those two values are identical then (if there are no client bugs) everything else in the AttestationData should be identical as well...

terencechain commented 4 years ago

For Lighthouse all we can recall (without additional history capabilities) is the latest attestation.data.beacon_block_root and attestation.data.target.epoch for each validator.

Same for Prysm

prestonvanloon commented 4 years ago

Prysm also has functionality in place to check objects against the slasher server to assess whether or not a given object is going to produce a slashable condition before signing.

This, combined with a validator's local history of signed objects, seems to provide reasonable protections against slashings. However, it assumes that the slasher is synced and working properly and that the user has enabled these features. Currently the external validation (querying the slasher) is not enabled by default. The local storage of signed objects is enabled by default.

Querying the slasher also protects against potential clock skews. If someone were to set you system clock back by one hour, then your validator might repeat the last hour's worth of events which could very likely cause a double sign slashing.

In any case, there is a possible race condition in the validators.

Assume the validators clocks are perfectly in sync +/- 1ms

ValidatorA queries "have I already attested before?" Answer: no
ValidatorB queries "have I already attested before?" Answer: no
ValidatorA and ValidatorB submit attestations.

For step 3, if both validators are talking to the same beacon node, you're probably OK but still at a high risk. If not, then this doesn't solve the problem.

dankrad commented 4 years ago

For Lighthouse all we can recall (without additional history capabilities) is the latest attestation.data.beacon_block_root and attestation.data.target.epoch for each validator.

Yeah, so I think checking these two against the local validator store of the latest signed message would be pretty valuable already.

In any case, there is a possible race condition in the validators.

I am aware of this race condition, and it's unavoidable. I just want to catch those cases which we can easily catch in the current framework, which might already prevent of a lot of user error slashings.

dapplion commented 11 months ago

This has been implemented in all clients as doppelganger detection (nimbus guide as example https://nimbus.guide/doppelganger-detection.html)

One route is standardised for it:

https://github.com/ethereum/beacon-APIs/blob/ad873a0f8d2c2a7c587dd667274199b2eb00f3ba/beacon-node-oapi.yaml#L203

ethereum / beacon-APIs

Prevent slashings from accidental double starts #64