Open irfansharif opened 2 years ago
If there's latency impact during bursts of GC activity, in addition to admission controlling these requests (and/or manual control knob driven pacing), https://github.com/cockroachdb/cockroach/issues/55293 will help minimize the latency impact on foreground traffic due to the latches we currently hold during GC.
(had to remove previous post to use correct user)
This image and above are from 50+ days of stuck PTS being removed, thus triggering a significant amount of GC activity.
@mikeczabator is there a ticket associated with the previous graphs? I am curious whether this incident showed any admission control queueing (assuming v21.2+), and whether the store was overloaded (read amp), and whether the provisioned disk bandwidth was saturated.
I've lost the original thread where this issue was discussed, but during this large volume of MVCC GC since a lot of ranges got smaller in size, there was a large build up of merge queue work and subsequent activity. It's possible that the effects observed above were as a result of the merges themselves (what exactly, I'm not sure -- the snapshots, non-MVCC latch acquisition, frozen RHS). We should try to repro that independently; for this issue we should first make sure that large volumes of MVCC GC work is in fact disruptive and benefiting of admission control.
This is an internal incident where we observed the same effects.
Pacing knob (bytes/sec/store) if we have a large backlog of work pent up from super old protection records
I suggest this should be a last resort, if we find that setting a lower admission control (AC) priority and other AC improvements do not suffice.
Yup, just wanted to list there for posterity. A first step here for anything is developing a reproduction for scenarios where we suddenly have a large build up of MVCC garbage, by perhaps installing a protected timestamp record and running a form of outbox workload along side, and then dropping the record.
@aadityasondhi is going to try and take a stab during fridays. Assigning for now.
@bananabrick what is the state of this issue? Do we have a simple enough reproduction?
We didn't really approach this issue based on what's written in the issue description. Instead, we observed two modes of problems due to GC.
The first one was that GC could increase the size compensated scores of levels in the lsm, which would lead to L0 compactions being starved out. That problem is fixed in https://github.com/cockroachdb/cockroach/issues/104862.
The second problem is due to GC writing too much to the LSM and causing overload. We don't have a fast reproduction for that. This should help: https://github.com/cockroachdb/cockroach/commit/08cc9b17ce0a5a61745f1880fe7d627704344875. Doesn't entirely fix the problem, but the customer we saw run into this issue also had the capacity to increase compaction concurrency. For now, I vote that we table the second problem and use the compaction concurrency as a knob to alleviate the problem if it occurs in the future.
We don't have evidence of GC issues due to CPU utilization.
processReplicatedKeyRange
, since the scan does not use a BatchRequest
and directly iterates using a storage.Snapshot
.
Is your feature request related to a problem? Please describe.
We've seen large backlogs of GC-able keys build up when installing long-lived PTS records. When released the MVCC GC queues go full throttle GC-ing as aggressively as possible, which we've seen to be disruptive to foreground traffic.
Describe the solution you'd like
Reproduce + roachtest-ify lack of performance isolation (latency) as a result of a large volume MVCC GC work. Hint: try creating a database with a large dataset (TPC-C with N warehouses), set a low default GC TTL, drop the DB, and let things rip.Failed attempts to create a small roachtest.Evaluate CPU overhead and latency impact of subsequent merge queue activity (including generating/applying snapshots in preparation of the range merge, stats computation)Evaluate overhead of secondary compactions when replicas are removed due to merge-queue related rebalancingNote down other effects of post-MVCC activity(optional) Knob to disable MVCC GC queue cluster wide (poor man's pacing).(optional, very low priority, probably just a bad idea) Pacing knob (bytes/sec/store) if we have a large backlog of work pent up from super old protection records;Aside: #84598 is tangentially related, motivated by the same incidents that motivated this issue. It considers how long garbage can be accumulated in the face of transient job failures. This issue covers the pacing needed independent of the amount of garbage that needs cleaning. That said, a short default retention window is good for many reasons, including the secondary effects of MVCC GC as mentioned above. Pacing all secondary effects is a larger endeavour (snapshots alone are covered in https://github.com/cockroachdb/cockroach/issues/80607) and out of scope for this issue.
Jira issue: CRDB-16754