kvserver: consider prioritizing ranges for consistency checks

erikgrinaker commented 2 years ago

Currently, the consistency checker schedules ranges on an ad hoc basis -- the replica scanner will run each local range through the queue every 10 minutes, and the queue will schedule any ranges that haven't been checked for 24 hours. On large clusters, it can take significantly longer than 24 hours to get through all ranges, and some ranges may, by pure chance, rarely or never get scheduled.

We should consider some sort of prioritization scheme here, e.g. by prioritizing ranges that have actually seen writes since the last consistency check (even though e.g. Pebble compactions or hardware issues may cause consistency failures on cold ranges too). We should also consider giving more weight to ranges depending on how long it's been since they were last checked, so that we e.g. will end up checking every range at least weekly.

Jira issue: CRDB-19239

Epic CRDB-39898

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/replication

pav-kv commented 2 years ago

Exploring the possibility to use the "last update" timestamp in MVCC stats, to help prioritising ranges for a consistency check.

Here are some stats from one node of a fairly large cluster (~20 nodes, ~1.5 TB data per node):

$ jq '.state.state | (1665524192-.stats.last_update_nanos/1000000000)/3600' *.json | sort -n | ./histogram.awk | sort -n -k 1
0.00 -> 9179  -- 72.5%, c 72.5%
1.00 -> 173   -- 1.4%,  c 73.9%
2.00 -> 134   -- 1.1%,  c 74.9%
3.00 -> 33    -- 0.3%,  c 75.2%
4.00 -> 35    -- 0.3%,  c 75.4%
5.00 -> 23    -- 0.2%,  c 75.6%
6.00 -> 20    -- 0.1%,  c 75.8%
7.00 -> 15    -- 0.1%,  c 75.9%
8.00 -> 77    -- 0.6%,  c 76.5%
16.00 -> 319  -- 2.5%,  c 79.0%
24.00 -> 706  -- 5.6%,  c 84.6%
32.00 -> 23   -- 0.2%,  c 84.8%
40.00 -> 47   -- 0.4%,  c 85.2%
48.00 -> 1039 -- 8.2%,  c 93.4%
56.00 -> 399  -- 3.2%,  c 96.5%
64.00 -> 256  -- 2.0%,  c 98.5%
72.00 -> 160  -- 1.3%,  c 99.8%
80.00 -> 25   -- 0.2%,  c 100%

This splits time into 1h buckets up to 8h, and then 8h buckets onwards. It shows how many ranges were updated within each bucket and cumulatively. 73% were updated within last hour, and 79% within last day.

Interestingly, there are bumps around multiples of 24h.

What are conditions under which MVCC last update timestamp is touched? Does it include compactions or any periodic job?

erikgrinaker commented 2 years ago

What are conditions under which MVCC last update timestamp is touched? Does it include compactions or any periodic job?

Likely every write request. You could spin up a local cluster with some additional logging whenever it changes on a range, and look at the commands in the batch that updated it.

erikgrinaker commented 2 years ago

To be clear, it can only be affected by Raft commands, typically from RPC write requests. Compactions run below Raft, and thus don't affect it. But MVCC GC runs above Raft, and will affect it if it GCs anything.

pav-kv commented 1 year ago

We should also consider giving more weight to ranges depending on how long it's been since they were last checked

This is already taken into account: https://github.com/cockroachdb/cockroach/blob/f188d21d1853ca8b81d9c4f0ab46d7c0fa34ee44/pkg/kv/kvserver/consistency_queue.go#L130-L154

pav-kv commented 1 year ago

However: https://github.com/cockroachdb/cockroach/blob/c4d2def27bcdf7a80a668abeb9695c9153457b22/pkg/kv/kvserver/queue.go#L374-L376

We may still skip some ranges if they are consistently scanned last, so that no matter what priority they have, they will be dropped.

I wonder:

How other queues avoid this problem.
Whether it's possible to "force" push into the queue if it's full and our pri is higher than someone's in the queue (we can replace them).

pav-kv commented 1 year ago

(2) might be already happening. Checking. Upd: right, it does: https://github.com/cockroachdb/cockroach/blob/c4d2def27bcdf7a80a668abeb9695c9153457b22/pkg/kv/kvserver/queue.go#L762-L766

So we don't actually have the problem described in the previous comment. All replicas get processed eventually. If there is an overload, only the top K replicas with the oldest last attempt do. During the next scan, another top K is processed, and so on.

pav-kv commented 1 year ago

I think the important property that we need to maintain is: no matter what, a replica is eventually added to the queue.

Currently this is trivially guaranteed because the priority is proportional to the idle time, and processing a replica drops it behind any unprocessed replicas. If we mix in the "last modification" timestamp into this priority, it must not introduce an eternal starvation to the infrequently/never modified replicas.

One way to achieve this is:

The "main" factor that goes into the priority is still the duration since the last queuing event. It grows unboundedly with passage of time, so eventually it will "win" any other priority "addons".
The "last modification time" is a secondary factor that can contribute a bounded bump to the priority.

Currently, priority = (now - last_queue) / 24h, so after 24h it grows linearly starting from 1.

To mix in the "modification time", we can say: priority = (now - last_queue) / 24h + max(0, last_modified + 24h - now) / 24h. I.e., when a replica is modified (now == last_modified), it gets a +1 bump in priority (equivalent to 24h), and as it gets older, this bump returns to 0. The major factor is still now - last_queue.

Ex 1. The scan period is 24h, and the load is below the limit, so the node can actually scan all replicas within 24h. Nodes that were recently touched will tend to be processed first, but still once per 24h.

Ex 2. The node can't make it in 24h, it takes 3d to scan / consistency check all replicas. Each day, a mix of replicas is considered: those who weren't scanned for 2-3d, and those who weren't for 1-2d but were modified within the last day. As a result of this reshuffling, some replicas that are rarely touched will be processed once in 4d (instead of 3d previously), but that's unavoidable if we change priorities (just need to make sure it's not 1000 days).

@erikgrinaker @tbg The approach is raw and needs more thinking, but WDYT about it so far?

cockroachdb / cockroach

kvserver: consider prioritizing ranges for consistency checks #87255