Cap validator count at 1m with sortition

One of the weaknesses of the current spec is the very high variance in client load: clients need to be designed to potentially handle an extremely high load supporting 4 million validators, but real-life load is likely to be much lower (expected to be 100x lower close to launch). This means that validator operators must either get a powerful computer "just in case" or run the risk of being unable to keep up if far more validators join then expected.

This post is one possible solution to this conundrum, by capping the number of validators to 1 million (actually 2**20) in a randomized and fair (and hence unexploitable) way.

Outline

Add a "dormant" state (in addition to "awaiting activation" / "activated" / "exited" / "withdrawn"); this would be done either by adding a sleep_epoch and wake_epoch or by adding a dormant:bool and dormancy_transition_epoch (we need to store it as epochs to preserve the invariant that all state changes are predictable by 4 epochs). Dormant validators can skip the queue to exit.

If the total number of active validators at the current time exceeds 2**20, then add the following rules:

Validators that are activated via the activation queue are instead moved into the dormant state.
Let N be the number of validators that normally would be activated via the activation queue mechanism assuming 2**20 active validators (currently that's 16). At the end of each epoch, N randomly selected dormant validators are activated, and N randomly selected active validators are made dormant.

The random selection is important, because it ensures that (i) an attacker cannot join with new validators and replace existing participants without being equally diluted themselves, and (ii) there is no benefit from exiting-then-reentering.

Economic effects

In the case where there are more than 2**20 validators, this proposal would have two sets of consequences to validators. First, per-validator rewards would drop by 1% per 1% gain in staking participation (instead of the status quo: 0.5% drop per 1% gain in participation). Second, validators' costs would go down, because validators would be offline some of the time, and validators would have more freedom to remove some of their funds reliably, reducing implied capital costs.

Note that there transition between the under-the-cap regime and the at-the-cap regime is gradual, because if the number of total participating validators is only slightly above 2**20 then any validator that is forced into dormancy can expect to be woken back up very quickly.

Possible extensions

Use 2**19 (524288 validators, ~16.7m ETH) as the cap instead of 2**20.
We can make the rotation happen faster by rotating a fixed percentage of validators (eg. 1/64) every time the chain finalizes. This allows us to rotate validators quickly without violating BFT set intersection invariants that would cause a reduction in safety.

Simulation code

Here is some quick simulation code that shows what happens if there are 100 active validators and 50 new ones join, assuming a cap of 100. The distribution quickly stabilizes into the optimal (67 old, 33 new).

import random

active = list(range(100))
dormant = []
for i in range(200):
    if i < 50:
        dormant.append(100+ i)
    dormant.append(active.pop(random.randrange(len(active))))
    active.append(dormant.pop(random.randrange(len(dormant))))
    print("Active: {} original {} new".format(
        len([x for x in active if x < 100]),
        len([x for x in active if x >= 100])
    ))

ethereum / consensus-specs