cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.88k stars 3.77k forks source link

kv: permit merging of ranges without checking size or load conditions #106201

Open udnay opened 1 year ago

udnay commented 1 year ago

During a recent incident in CC a multi-region host cluster had a set of 3 crash looping nodes due to a bad range split.

The fix was to patch the bad nodes with a custom binary that would prevent the nodes from crashing on startup so that our tooling could merge the range back together. This issue is intended to have the KV team explore how to use some of our LoQ tools in an offline capacity where we are unable to restart a node due to the issue being investigated.

cc @arulajmani

Jira issue: CRDB-29445

nvanbenschoten commented 1 year ago

I'm going to co-opt this issue and take it in a slightly different direction that is more generally applicable to the support issues we've seen in the past year. We occasionally want to merge away a range manually, without concern for load or size on the joint range. mergeQueue.process has a number of checks (see "skipping merge") which are performed even if a manual enqueue passes "skipShouldQueue". It would be helpful to have a way to bypass these process-time checks.

kvoli commented 1 year ago

This would also have been valuable in merging away bad split points in this CC incident.

The ranges in this case were too large to merge without creating splits on either side of the bad split point. It could be desirable to force a merge regardless, and have size based splitting immediately find a new split point (which is safe).