ethereum / consensus-specs

Ethereum Proof-of-Stake Consensus Specifications
Creative Commons Zero v1.0 Universal
3.56k stars 972 forks source link

Better recovery from inactivity leaks #2098

Open dankrad opened 4 years ago

dankrad commented 4 years ago

Rationale:

The quadratic inactivity leak is a penalty for being offline that increases in time quadratically in periods where the beacon chain doesn't finalized. It has the effect that should more than 1/3 of validators be offline, within a few days or weeks these validators will see increased losses to their balances, reducing them back to a total of 1/3 of the validator set and enabling finalization. (One popular misunderstanding is that it requires validator ejection to work -- it does not depend on ejecting the offline validators, only reducing their balances to diminish their weight in the FFG votes)

However, at this point, the quadratic leak will stop for all validators, including those that remain offline. This means that if it was caused by a catastrophic event permanently disabling >1/3 of validators, or is due to a "User Activated Soft Fork" removing some validators deemed malicious from the network, the network will be in a state where only exactly 2/3+1 of the validator balance is online when it recovers due to the leak. The offline validators will continues to suffer a normal inactivity leak, but this is just the negative of the usual rewards and will thus take on the order of months to years to have any meaningful effect. This means that the network is going to be precarious for a long time, only having just enough online stake to get finalized when everything works perfectly. In practice, this means finalization will very often be delayed or not happening for extended periods of time just due to minor outages.

Idea:

To fix this poor outcome, we can change the inactivity leak as follows: Instead of stopping the leak for all validators when the chain finalizes, we can keep the leak going for all validators until they actually sign another attestation. This will make sure that validators which are permanently offline keep experiencing a much quicker drain, will actually be kicked out of the validator set and the chain can hopefully recover to almost 100% participation.

mcdee commented 4 years ago

If you're keeping track of the last time that a validator attested then wouldn't it be simpler to just apply the quadratic leak to all validators at all times as a "non-participation" penalty (removing the existing penalty system)?

dankrad commented 4 years ago

If you're keeping track of the last time that a validator attested then wouldn't it be simpler to just apply the quadratic leak to all validators at all times as a "non-participation" penalty (removing the existing penalty system)?

That is probably a simpler approach, but may be seen as too punitive? We typically want to punish validators if their individual faults have actually led to real safety/liveness problems on chain.

One intermediate: The quadratic leak applies during the period when a validator is offline, if during that period there was a period of 8 epochs of non-finalization.

One problem is that this makes it much more likely that you will lose a full 16 ETH for e.g. losing your keys (or dying -- your heirs might not be very quick at recovering your keys and continuing to validate, or even realizing that they have to do this). This is much more than is expected for a typical slashing, so it may be quite terrifying.

mcdee commented 4 years ago

We typically want to punish validators if their individual faults have actually led to real safety/liveness problems on chain.

This doesn't seem to fit the current slashing punishment. If, for example, 10% of validators created a slashing event it would have ~0 impact on the safety/liveness of the chain (assuming the other validators are honest), but they are punished regardless, but the validator would be punished for ~10 Ether and forcibly ejected. Anyway, not totally on-topic so I'll leave that bit.

If we think it's fair to penalise validators in a certain way after an inactivity leak, it seems to make sense to punish them that way even if there hasn't been an inactivity leak. Medalla is a good example of this: it had an inactivity leak, but right now doesn't. If I had a validator I started just before the inactivity leak, and another that started just after we recovered finalization, would they both be punished the same way now for not attesting, or would the older one have the inactivity leak applied and the newer one not? It seems that this post-hoc alternative punishment adds complexity and confusion, rather than having a single system across the board.

I agree that this is more punitive than the existing penalty system long-term, but in the short term it's actually beneficial. Some (very) quick maths suggests that the quadratic leak with the parameters for phase 0 launch would be far better for validators that don't validate for a few hours, which would cover those who have a computer crash overnight, short-term upgrade issues, etc. (see https://imgur.com/3oGHDXN.png for the graph, although as mentioned maths was quick so could be incorrect).

With current mainnet parameters we're looking at ~43 days of being offline before losing half your stake. Perhaps this could be relaxed further, or perhaps we have a multiplier for the quadratic leak depending on if the chain is reaching finality or not, but a single system for punishing non-attesting validators seems to be a cleaner solution than having conditionals. I do understand, however, if it's considered too punitive in general to have a quadratic punishment as opposed to a linear one when the chain is behaving.

dankrad commented 4 years ago

If, for example, 10% of validators created a slashing event it would have ~0 impact on the safety/liveness of the chain (assuming the other validators are honest), but they are punished regardless, but the validator would be punished for ~10 Ether and forcibly ejected.

This is true, but the argument would be that 10% is already getting in the direction of a coordinated attack. I am more speaking of <1% failures which we currently don't punish harshly, in the interest of getting people to stake.

But this highlights one possible construction error of the inactivity leak: Whereas the other anti-correlation penalties will already impose serious penalties when failures are only part of the way to an attack, the inactivity leak only responds when a failure has already happened. I guess the more consistent way to do this would be to always have a quadratic inactivity leak, but make it proportional to the percentage of offline stake. So, even at 10% offline you would get a serious inactivity leak, which is more consistent with other penalties.

mcdee commented 4 years ago

I guess the more consistent way to do this would be to always have a quadratic inactivity leak, but make it proportional to the percentage of offline stake.

That sounds like a very interesting idea. Single system, unifies the current punishment and inactivity leak mechanisms, and proportional to the impact it is having on the network. Seems to tick all the boxes.

djrtwo commented 4 years ago

I think the following captures what we'd like to see while being a bit less punitive than @dankrad's original proposal

This avoids the case where we don't finalize for a while, e.g. 100 epochs, but then finalize one epoch and lose the inactivity leak quadratic build up.

mcdee commented 4 years ago

@djrtwo that seems to be attempting to address a different problem than the one @dankrad outlined. From what I can see, yours will increase the inactivity leak in the "10-on-100-off" situation where the majority of epochs are not finalized, but due to the existing mechanics of the inactivity leak the punishment in the "100-off" part are very low, even if this pattern repeats indefinitely.

It does feel that these various methods could be unified in to a single penalty system as Dankrad suggested.

seascape195 commented 3 years ago

Move dead node to a 'time out space' holding block. Require all 2.0 validators to Post next of kin contact. If node unresponsive for x time, move to holding block, kick off next of kin contact with instructions and timeframe limit to get the node back online or terminate the node for good, cash out diminishing return on staked amount until staked amount reaches zero or node reinstated and active validating nodes again.

dapplion commented 10 months ago

Implemented in altair's HF by introducing a per-validator counter: inactivity score

https://github.com/ethereum/consensus-specs/blob/bf09b9a7c4a7b311e86823235815daf31b117574/specs/altair/beacon-chain.md#modified-get_inactivity_penalty_deltas