kvserver: allocation thrashing prevention and observability

kvoli commented 1 year ago

This is a tracking issue for undesirable behavior found in testing. Many of these issues have been filed and made better already. The purpose is to consolidate discussion for future reference.

We are developing significant changes to the allocation system in https://github.com/cockroachdb/cockroach/issues/103320. This issue references existing concepts which don't all translate to the changed system; the principles still apply however not all solutions do.

Thrashing

Allocation thrashing is where the cluster continuously reconfigures range replicas and leases without materially improving the long term distribution.

Thrashing is always possible and is usually the result of trading off stability for reactiveness. Here's a list of known thrashing causes in CRDB, which could be actionable or have had recent development to address. The issues came from a specific instance of thrashing in a multi-region CRDB cluster running a uniform workload on v22.2.9.

Details on cluster configuration

**Localities** There is a lease preference set on `us-east-1` and a constraint that 2 voters are required in the region. There are 3 zones in `us-east-1`: ``` us-east-1a: 9 10 us-east-1b: 11 12 us-east-1c: 13 ``` Given the state of a range's voters, the equivalence classes for nodes: ``` equivalent other_voter 11|12|13 (9|10) 9|10|13 (11|12) 9|10|11|12 (13) ```

Allocator storepool / store rebalancer see different states of the cluster.

23.1+: Synchronize after each change and account for more actions #91633

Details

Bad transfer from perspective of store rebalancer, s9 has higher load than s12. ``` transferring lease for r372 (qps=35.46) to store s9 (qps=2095.87) from local store s12 (qps=2036.49) ``` The store pool is used for most of the candidate selection, it must have had a different view. This can be shown from reconstructing logging on a node (`n12`). `qps` is the storepool queries-per-second, `sr_qps` is the storepools view. ```bash avg ranges=224 leases=48 qps=909.61 ---- 1: ranges=211 leases=8 qps=10.67 b4_qps=10.73 2: ranges=211 leases=8 qps=0.07 b4_qps=0.07 3: ranges=212 leases=9 qps=3.98 b4_qps=3.99 4: ranges=228 leases=3 qps=6.00 b4_qps=6.00 5: ranges=228 leases=4 qps=16.66 b4_qps=16.66 6: ranges=228 leases=6 qps=7.72 b4_qps=7.73 7: ranges=227 leases=5 qps=6.02 b4_qps=6.02 8: ranges=229 leases=7 qps=6.11 b4_qps=6.11 9: ranges=229 leases=112 qps=1730.95 b4_qps=1731.13 sr_qps=2511.04 10: ranges=218 leases=128 qps=3381.77 b4_qps=3465.22 11: ranges=231 leases=122 qps=2588.82 b4_qps=2588.82 12: ranges=232 leases=81 qps=2813.95 b4_qps=2813.65 sr_qps=1138.21* 13: ranges=236 leases=138 qps=1252.24 b4_qps=1250.63 sr_qps=2079.43 ```

There is too much obviously useless rebalancing, which has greater cost than benefit.

Open PR: Increase min required rebalance impact % #98889

Details

Here's a summary of replica rebalances performed by the store rebalancer during a rebalance pass. Note the duration between changes. The goal is QPS convergence, however without also transferring a lease, the action isn't that useful - especially when the replica only has <10 QPS. ```bash 17:39:00 10 -> 9 qps=36.66 r=1101 17:39:14 10 -> 12 qps=36.53 r=527 17:39:27 10 -> 12 qps=36.44 r=1098 17:39:47 10 -> 12 qps=34.79 r=336 17:40:06 10 -> 9 qps=34.71 r=204 17:40:19 10 -> 9 qps=34.49 r=1089 17:40:37 10 -> 9 qps=7.64 r=440 # lease also transfers here from 13 -> 12 17:41:16 13 -> 10 qps=7.35 r=355 17:41:44 11 -> 10 qps=7.34 r=398 ```

Lease transfers use the cluster domain (mean) for creating thresholds, lease stores were over-eager to transfer a lease as a result.

Open PR: Use the lease preference + voters for creating a better domain #98893

Too many actions occur in a short period for how confident we are the actions would be judged as good in near future.

There's nothing stopping a store from shedding 128 leases during the rebalance loop, in the matter of a few seconds or less. This exacerbates the gossip capacity change issue mentioned below. Also, it doesn't seem like a good idea to be as aggressive as this unless our information is similarly accurate on the same timescale. There's a 5s period before a store will account a replica's load - so other stores wouldn't be aware of the capacity change until after then. The source store would know, however it was likely overwritten (see capacity change triggered gossip mentions).

Related to the above point, lease and replica change triggered gossip will overwrite QPS estimates. Leading to a more consistent but incorrect view of the world across storepools in the cluster.

TBD: We could constrain the number of actions per period, or % load churn.
23.1+: Use delta rather than absolute # of capacity changes to trigger gossip. #93945
23.1+: The lease/range count change required also got bumped earlier in #93555.
TBD: Is this enough? Now less likely but not impossible. We could apply the load delta to the cached capacity which is gossiped. See above point, other stores wouldn't know about the receiver +delta, only the sender -delta.

Details

The state of the world post rebalance loop on `n12` @ `17:39:11`: ```bash avg ranges=224 leases=48 qps=909.61 ---- 1: ranges=211 leases=8 qps=10.67 b4_qps=10.73 2: ranges=211 leases=8 qps=0.07 b4_qps=0.07 3: ranges=212 leases=9 qps=3.98 b4_qps=3.99 4: ranges=228 leases=3 qps=6.00 b4_qps=6.00 5: ranges=228 leases=4 qps=16.66 b4_qps=16.66 6: ranges=228 leases=6 qps=7.72 b4_qps=7.73 7: ranges=227 leases=5 qps=6.02 b4_qps=6.02 8: ranges=229 leases=7 qps=6.11 b4_qps=6.11 9: ranges=229 leases=112 qps=1730.95 b4_qps=1731.13 sr_qps=2511.04 10: ranges=218 leases=128 qps=3381.77 b4_qps=3465.22 11: ranges=231 leases=122 qps=2588.82 b4_qps=2588.82 12: ranges=232 leases=81 qps=2813.95 b4_qps=2813.65 sr_qps=1138.21* 13: ranges=236 leases=138 qps=1252.24 b4_qps=1250.63 sr_qps=2079.43 ``` It seems likely that `n12` got store descriptor updates between `17:39:{06-11}` which were prior to the impact of rebalancing being realized (>5s + gossip min). The update blind writes and would overwrite any previous estimates. The stores gossip on lease/qps changes. - The change source store knows it now has less load. This should be reflected locally. - The change target store doesn't know it now has more load, just that someone has less. What is odd, is that the lease count changes are reflected, but not the QPS changes. When we eagerly gossip on range/lease count changes, we use a cached descriptor - with the updated lease/range count applied. No other change would be applied. **This could lead to the inconsistent state between leases/qps** seen below. The effect is also symmetric, both the source and target store would have any estimate overwritten with an identical QPS to prior to the transfer but an updated lease count. The state of the world on `n9` @ `17:39:11` for reference: ```bash avg ranges=224 leases=48 qps=906.93 ---- 1: ranges=211 leases=8 qps=10.67 2: ranges=211 leases=8 qps=0.07 3: ranges=212 leases=9 qps=3.98 4: ranges=228 leases=3 qps=6.00 5: ranges=228 leases=4 qps=16.66 6: ranges=228 leases=6 qps=7.72 7: ranges=227 leases=5 qps=6.02 8: ranges=229 leases=7 qps=6.11 9: ranges=229 leases=111 qps=1724.67 10: ranges=217 leases=128 qps=3381.77 11: ranges=231 leases=122 qps=2560.35 12: ranges=232 leases=81 qps=2813.95 13: ranges=237 leases=138 qps=1252.24 ``` Note how it is fairly consistent with `n12`'s storepool state but not the store rebalancer on `n12`.

There's no concept of cluster level, per-store perfect number based on constraints and availability.

i.e. Synthesize all the constraints, localities (inc diversity) and load into a target number for each store. The number represents what the load should be on that store if the placement were optimal.

Likely too expensive to find the optimal number on each input change or decision time.
It isn't clear whether this is desirable either. There's more load than just replicas and leases in the cluster and these aren't always highly correlated. Users want predictable, explainable latency + throughput. Ignoring this load has been the strategy up to 23.1.

Observability

It is tedious to reconstruct a node's view of the world and recent "impact" it expects from its changes

Periodic logging of where a store sent stuff (period aggregate) - broken down by reason and including the load deltas. Rebalancing/Transfers appear useful, unsure on other reasons.

There's no easy thrashing or stability observability

Jira issue: CRDB-28445

sumeerbhola commented 1 year ago

Given the state of a range's voters, the equivalence classes for nodes: equivalent other_voter 11|12|13 (9|10) 9|10|13 (11|12) 9|10|11|12 (13)

Are all of 9, 10, 11, 12, 13 voters? Or is this a description of the nodes in the cluster and not necessarily who currently has the replicas? I didn't understand this equivalence class computation. I would have expected the constraint and lease preference to be trivially satisfied since all the voters are in the expected region, and the only thing we would be trying to improve is the diversity score (if we happened to have more than 1 replica in a zone). What thrashing happened in this case?

kvoli commented 1 year ago

Are all of 9, 10, 11, 12, 13 voters?

No. The setup is 5 voters, with a lease preference on us-east-1 (9 10 11 12 13) and 2 voters in us-east-1.

[...] and the only thing we would be trying to improve is the diversity score (if we happened to have more than 1 replica in a zone)

This is for diversity score only. The equivalence class is based on equivalent diversity score. It helps explain the movements in the snippets for replica rebalancing.

If there are only 2 voters in us-east-1, then for 11|12|13 (9|10), (9|10) is the voter not being touched and 11|12|13 has the replace voter and its equally diverse alternatives.

The thrashing we saw looked like this (should have included in issue):

Note near the end, rebalancing was disabled which is why the series converged.

kvoli commented 1 year ago

The primary thrashing cause afaict was capacity change triggered gossip updates overwriting local change estimates.

https://github.com/cockroachdb/cockroach/issues/93532 is similar to the sequencing in allocator2 and would be helpful here.

cockroachdb / cockroach

kvserver: allocation thrashing prevention and observability #104292

Thrashing

Observability