Open udnay opened 1 year ago
I'm going to co-opt this issue and take it in a slightly different direction that is more generally applicable to the support issues we've seen in the past year. We occasionally want to merge away a range manually, without concern for load or size on the joint range. mergeQueue.process
has a number of checks (see "skipping merge") which are performed even if a manual enqueue passes "skipShouldQueue". It would be helpful to have a way to bypass these process-time checks.
This would also have been valuable in merging away bad split points in this CC incident.
The ranges in this case were too large to merge without creating splits on either side of the bad split point. It could be desirable to force a merge regardless, and have size based splitting immediately find a new split point (which is safe).
During a recent incident in CC a multi-region host cluster had a set of 3 crash looping nodes due to a bad range split.
The fix was to patch the bad nodes with a custom binary that would prevent the nodes from crashing on startup so that our tooling could merge the range back together. This issue is intended to have the KV team explore how to use some of our LoQ tools in an offline capacity where we are unable to restart a node due to the issue being investigated.
cc @arulajmani
Jira issue: CRDB-29445