kvserver: pacing/admission control for mvcc gc

irfansharif commented 2 years ago

Is your feature request related to a problem? Please describe.

We've seen large backlogs of GC-able keys build up when installing long-lived PTS records. When released the MVCC GC queues go full throttle GC-ing as aggressively as possible, which we've seen to be disruptive to foreground traffic.

Describe the solution you'd like

[x] Issue MVCC GC requests with the low priority bit set for admission control to deprioritize when nodes are pushed to saturation (https://github.com/cockroachdb/cockroach/pull/85823);
[x] Introduce a cluster setting to pace the amount of MVCC GC queue processing (https://github.com/cockroachdb/cockroach/pull/85823/commits/f00b130f51f7adc21f52c347b083ed66a40db1d8)
[ ] Reproduce + roachtest-ify lack of performance isolation (latency) as a result of a large volume MVCC GC work. Hint: try creating a database with a large dataset (TPC-C with N warehouses), set a low default GC TTL, drop the DB, and let things rip. Failed attempts to create a small roachtest.
- [ ] ~~Evaluate CPU overhead and latency impact of subsequent merge queue activity (including generating/applying snapshots in preparation of the range merge, stats computation)~~
- [ ] ~~Evaluate overhead of secondary compactions when replicas are removed due to merge-queue related rebalancing~~
- [ ] ~~Note down other effects of post-MVCC activity~~
[ ] Integrate MVCC GC work with elastic CPU tokens introduced in https://github.com/cockroachdb/cockroach/pull/86638.
[ ] ~~(optional) Knob to disable MVCC GC queue cluster wide (poor man's pacing).~~
[ ] ~~(optional, very low priority, probably just a bad idea) Pacing knob (bytes/sec/store) if we have a large backlog of work pent up from super old protection records;~~

Aside: #84598 is tangentially related, motivated by the same incidents that motivated this issue. It considers how long garbage can be accumulated in the face of transient job failures. This issue covers the pacing needed independent of the amount of garbage that needs cleaning. That said, a short default retention window is good for many reasons, including the secondary effects of MVCC GC as mentioned above. Pacing all secondary effects is a larger endeavour (snapshots alone are covered in https://github.com/cockroachdb/cockroach/issues/80607) and out of scope for this issue.

Jira issue: CRDB-16754

irfansharif commented 2 years ago

If there's latency impact during bursts of GC activity, in addition to admission controlling these requests (and/or manual control knob driven pacing), https://github.com/cockroachdb/cockroach/issues/55293 will help minimize the latency impact on foreground traffic due to the latches we currently hold during GC.

mikeczabator commented 2 years ago

(had to remove previous post to use correct user)

mikeczabator commented 2 years ago

This image and above are from 50+ days of stuck PTS being removed, thus triggering a significant amount of GC activity.

sumeerbhola commented 2 years ago

@mikeczabator is there a ticket associated with the previous graphs? I am curious whether this incident showed any admission control queueing (assuming v21.2+), and whether the store was overloaded (read amp), and whether the provisioned disk bandwidth was saturated.

irfansharif commented 2 years ago

I've lost the original thread where this issue was discussed, but during this large volume of MVCC GC since a lot of ranges got smaller in size, there was a large build up of merge queue work and subsequent activity. It's possible that the effects observed above were as a result of the merges themselves (what exactly, I'm not sure -- the snapshots, non-MVCC latch acquisition, frozen RHS). We should try to repro that independently; for this issue we should first make sure that large volumes of MVCC GC work is in fact disruptive and benefiting of admission control.

irfansharif commented 2 years ago

This is an internal incident where we observed the same effects.

sumeerbhola commented 2 years ago

Pacing knob (bytes/sec/store) if we have a large backlog of work pent up from super old protection records

I suggest this should be a last resort, if we find that setting a lower admission control (AC) priority and other AC improvements do not suffice.

irfansharif commented 2 years ago

Yup, just wanted to list there for posterity. A first step here for anything is developing a reproduction for scenarios where we suddenly have a large build up of MVCC garbage, by perhaps installing a protected timestamp record and running a form of outbox workload along side, and then dropping the record.

irfansharif commented 1 year ago

@aadityasondhi is going to try and take a stab during fridays. Assigning for now.

sumeerbhola commented 1 year ago

@bananabrick what is the state of this issue? Do we have a simple enough reproduction?

bananabrick commented 1 year ago

We didn't really approach this issue based on what's written in the issue description. Instead, we observed two modes of problems due to GC.

The first one was that GC could increase the size compensated scores of levels in the lsm, which would lead to L0 compactions being starved out. That problem is fixed in https://github.com/cockroachdb/cockroach/issues/104862.

The second problem is due to GC writing too much to the LSM and causing overload. We don't have a fast reproduction for that. This should help: https://github.com/cockroachdb/cockroach/commit/08cc9b17ce0a5a61745f1880fe7d627704344875. Doesn't entirely fix the problem, but the customer we saw run into this issue also had the capacity to increase compaction concurrency. For now, I vote that we table the second problem and use the compaction concurrency as a knob to alleviate the problem if it occurs in the future.

We don't have evidence of GC issues due to CPU utilization.

sumeerbhola commented 1 year ago

Declaring this done wrt the write impact of GC.
For the scan part, we are keeping the list item in the original description to integrate with elastic CPU AC , if we observe this to be a problem. This will require a change to the iteration in processReplicatedKeyRange, since the scan does not use a BatchRequest and directly iterates using a storage.Snapshot.

cockroachdb / cockroach

kvserver: pacing/admission control for mvcc gc #82955