cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

kvserver: replicate queue should be more responsive #106101

Open erikgrinaker opened 1 year ago

erikgrinaker commented 1 year ago

The replicate queue often takes a very long time to correct problems. For example, as seen in #106100, if a lease is picked up outside of the lease preferences it can take many minutes before the problem is corrected. This tends to be the case for most policies enforces by the replicate queue.

We should make the queue more responsive. A few random ideas:

Jira issue: CRDB-29400

andrewbaptist commented 1 year ago

A few things to note here. The replicate queue could run a lot faster except for a few things: 1) The replicate queue currently runs over leases and replicas. This is not really necessary as the checks and handling are quite different. Replicas are expensive to move, so taking minutes to scan over them is usually fine since a majority of the time is spent on the transferring, not the finding. 2) Leases can be generally moved for 3 reasons: 1) Load, 2) Imbalance, 3) Preference violation. The load one is already quick, the imbalance does not need to be fast, but the preference violation is slow (since it uses the imbalance mechanism). As mentioned in the first point, if we had a separate mechanism to handle imbalance vs constraint violation, the constraint violation check could be triggered quickly as soon as there is any constraint change against every range.