cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.22k stars 3.82k forks source link

kvserver: pacing/admission control for mvcc gc #82955

Open irfansharif opened 2 years ago

irfansharif commented 2 years ago

Is your feature request related to a problem? Please describe.

We've seen large backlogs of GC-able keys build up when installing long-lived PTS records. When released the MVCC GC queues go full throttle GC-ing as aggressively as possible, which we've seen to be disruptive to foreground traffic.

Describe the solution you'd like

Aside: #84598 is tangentially related, motivated by the same incidents that motivated this issue. It considers how long garbage can be accumulated in the face of transient job failures. This issue covers the pacing needed independent of the amount of garbage that needs cleaning. That said, a short default retention window is good for many reasons, including the secondary effects of MVCC GC as mentioned above. Pacing all secondary effects is a larger endeavour (snapshots alone are covered in https://github.com/cockroachdb/cockroach/issues/80607) and out of scope for this issue.

Jira issue: CRDB-16754

irfansharif commented 2 years ago

If there's latency impact during bursts of GC activity, in addition to admission controlling these requests (and/or manual control knob driven pacing), https://github.com/cockroachdb/cockroach/issues/55293 will help minimize the latency impact on foreground traffic due to the latches we currently hold during GC.

mikeczabator commented 2 years ago

(had to remove previous post to use correct user)

image

image

mikeczabator commented 2 years ago

This image and above are from 50+ days of stuck PTS being removed, thus triggering a significant amount of GC activity.

image
sumeerbhola commented 2 years ago

@mikeczabator is there a ticket associated with the previous graphs? I am curious whether this incident showed any admission control queueing (assuming v21.2+), and whether the store was overloaded (read amp), and whether the provisioned disk bandwidth was saturated.

irfansharif commented 2 years ago

I've lost the original thread where this issue was discussed, but during this large volume of MVCC GC since a lot of ranges got smaller in size, there was a large build up of merge queue work and subsequent activity. It's possible that the effects observed above were as a result of the merges themselves (what exactly, I'm not sure -- the snapshots, non-MVCC latch acquisition, frozen RHS). We should try to repro that independently; for this issue we should first make sure that large volumes of MVCC GC work is in fact disruptive and benefiting of admission control.

irfansharif commented 2 years ago

This is an internal incident where we observed the same effects.

sumeerbhola commented 2 years ago

Pacing knob (bytes/sec/store) if we have a large backlog of work pent up from super old protection records

I suggest this should be a last resort, if we find that setting a lower admission control (AC) priority and other AC improvements do not suffice.

irfansharif commented 2 years ago

Yup, just wanted to list there for posterity. A first step here for anything is developing a reproduction for scenarios where we suddenly have a large build up of MVCC garbage, by perhaps installing a protected timestamp record and running a form of outbox workload along side, and then dropping the record.

irfansharif commented 1 year ago

@aadityasondhi is going to try and take a stab during fridays. Assigning for now.

sumeerbhola commented 1 year ago

@bananabrick what is the state of this issue? Do we have a simple enough reproduction?

bananabrick commented 1 year ago

We didn't really approach this issue based on what's written in the issue description. Instead, we observed two modes of problems due to GC.

The first one was that GC could increase the size compensated scores of levels in the lsm, which would lead to L0 compactions being starved out. That problem is fixed in https://github.com/cockroachdb/cockroach/issues/104862.

The second problem is due to GC writing too much to the LSM and causing overload. We don't have a fast reproduction for that. This should help: https://github.com/cockroachdb/cockroach/commit/08cc9b17ce0a5a61745f1880fe7d627704344875. Doesn't entirely fix the problem, but the customer we saw run into this issue also had the capacity to increase compaction concurrency. For now, I vote that we table the second problem and use the compaction concurrency as a knob to alleviate the problem if it occurs in the future.

We don't have evidence of GC issues due to CPU utilization.

sumeerbhola commented 1 year ago