cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.16k stars 3.82k forks source link

kv: ability to gracefully recover from large intent buildup #135934

Open andrewbaptist opened 14 hours ago

andrewbaptist commented 14 hours ago

Is your feature request related to a problem? Please describe.

In a customer case, we saw almost 10 billion intents created over multiple ranges over a 5 hour window from a single INSERT INTO ... SELECT FROM statement that was ultimately killed. The MVCC GC queue ended up causing severe LSM inversion which ultimately caused a 30+ minute outage on a cluster. We need the ability to recover gracefully when we detect this type of problem.

Describe the solution you'd like There are two high level approaches to gracefully recovering: 1) Long term - Admission control automatically pacing the rate of intent resolution to prevent IO overload 2) Short term - Some knob to manually rate limit intent resolution

Current the mvccGCQueue calls CleanupTxnIntentsOnGCAsync which ends up calling cleanupFinishedTxnIntents.

See #97108 for more details on the current status of AC and intent resolution.

Jira issue: CRDB-44794