cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

admission,kv,bulk: unify (local) store overload protection via admission control #75066

Closed sumeerbhola closed 2 years ago

sumeerbhola commented 2 years ago

Consider a store with the capacity to accept writes at a rate of R bytes/s. This is a thought exercise in that R is not fixed, and is affected by various factors like disk provisioning (which can dynamically change), whether the write was a batch written via the WAL or an ingested sstable (and how many bytes are landing in L0), and compaction concurrency adjustment. We have two categories of mechanisms that attempt to prevent store overload:

This setup has multiple deficiencies:

We have existing issues for

We propose to unify all these overload protection mechanisms such that there is one source of byte tokens representing what can be admitted and one queue of requests waiting for admission.

Deficiencies:

cc: @erikgrinaker @dt @nvanbenschoten

Jira issue: CRDB-12450

Epic CRDB-14607

sumeerbhola commented 2 years ago

I've created a WIP PR for the admission control changes (it lacks tests and I need to fix existing tests, hence WIP), for early opinions https://github.com/cockroachdb/cockroach/pull/75120

The most familiar reviewers for the admission control code are @RaduBerinde and @ajwerner, but (a) it may be premature for them to review the PR while we aren't settled on the wider approach, (b) if we go ahead with this we should probably build some expertise in KV and storage with that code base since they are the ones who are going to be diagnosing how this is behaving in production settings.

erikgrinaker commented 2 years ago

Overall, I'm very supportive of this proposal -- I don't see how we can properly avoid overload while maximizing throughput without having all work pass through a common throttling mechanism. As you point out, we'll likely need prioritization here too.

We will need to prevent double counting: requests that went through admission control above raft at the leaseholder should not be subject to admission control below raft at the same node.

There can be a significant time delay between a request being admitted in Node.Batch and when it's actually applied, due to e.g. latching and other queues. Does admission control already take this into account, and/or would it be beneficial with additional below-Raft or just-above-Raft throttling to manage this?

Admission control logic will be enhanced to compute byte-based tokens for store writes. Those requests that provide their byte size (which should be all large requests) will consume these tokens.

Compression complicates this. Most request payloads (e.g. Put) are not compressed, but AddSSTable payloads are, so they will be underweighted compared to other traffic. Do you have any thoughts on how to address that?

they will not call the subsequent WorkQueue that throttled KV work based on cpu

This could actually come in handy for AddSSTable. To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately. It sounds like this might be a useful mechanism for that. CC @dt.

sumeerbhola commented 2 years ago

There can be a significant time delay between a request being admitted in Node.Batch and when it's actually applied, due to e.g. latching and other queues. Does admission control already take this into account, and/or would it be beneficial with additional below-Raft or just-above-Raft throttling to manage this?

It doesn't take this into account. However, the token calculation is done at 15s intervals, so unless the time delay you refer to is many seconds we should have both the admission decision and the storage work happen in the same interval. Are there any queues other than latching and locking? We ideally don't want to delay after acquiring a shared resource (latches/locks) since then we are holding back other work.

Compression complicates this. Most request payloads (e.g. Put) are not compressed, but AddSSTable payloads are, so they will be underweighted compared to other traffic. Do you have any thoughts on how to address that?

The estimate that admission control currently uses for each work item (which is also the case in the PR) is based on compressed bytes https://github.com/cockroachdb/cockroach/blob/bed5b793f9ed223b8133fad72f401ae8d02e8771/pkg/util/admission/granter.go#L1590, so it is comparable with AddSSTableRequest. If there are going to be other heavy-weight write requests (GCRequest?) then just summing the bytes in the keys is not comparable. We could either (a) live with this until we notice it is a problem (overcounting here will just cause the estimates for requests that don't provide any byte size to be smaller), (b) in MVCCGarbageCollect make a tighter guess on the size by applying key prefix compression (and use this to return tokens post-work).

This could actually come in handy for AddSSTable. To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately. It sounds like this might be a useful mechanism for that.

AddSSTable at the proposer, which is doing the rewriting, is already subject to admission control on master. This means it first goes through the store-specific WorkQueue and consumes the estimated bytes (which will be a wrong number, but which this issue aims to correct), and then the CPU-bound WorkQueue and consumes a slot (that is returned only when it finishes executing). Parallelization of AddSSTable is going to make the latter incorrect since it is actually using multiple slots. If we know the concurrency it is going to use at admission time, this could be fixed.

erikgrinaker commented 2 years ago

There can be a significant time delay between a request being admitted in Node.Batch and when it's actually applied, due to e.g. latching and other queues. Does admission control already take this into account, and/or would it be beneficial with additional below-Raft or just-above-Raft throttling to manage this?

It doesn't take this into account. However, the token calculation is done at 15s intervals, so unless the time delay you refer to is many seconds we should have both the admission decision and the storage work happen in the same interval. Are there any queues other than latching and locking?

Latching/locking is a major one, and in the case of long-running transactions these can be blocked for a substantial amount of time (e.g. minutes). If admission control does not take this into account it may be vulnerable to thundering herds. But there are also others, e.g. AddSSTable currently uses a semaphore to limit concurrent requests below the AdminKVWork() call in Node.Batch, and this can wait for a long time as well.

We ideally don't want to delay after acquiring a shared resource (latches/locks) since then we are holding back other work.

Yeah, I've been wondering if we might want a scheme which checks for any throttling right before or after latches have been acquired -- if throttled, the request would release its latches (if acquired) and go back to wait for resources and latches again. Of course, this could be vulnerable to starvation (depending on the queue policy), but it would avoid thundering herds as well as throttling below latches.

Compression complicates this. Most request payloads (e.g. Put) are not compressed, but AddSSTable payloads are, so they will be underweighted compared to other traffic. Do you have any thoughts on how to address that?

The estimate that admission control currently uses for each work item (which is also the case in the PR) is based on compressed bytes

https://github.com/cockroachdb/cockroach/blob/bed5b793f9ed223b8133fad72f401ae8d02e8771/pkg/util/admission/granter.go#L1590

, so it is comparable with AddSSTableRequest. If there are going to be other heavy-weight write requests (GCRequest?) then just summing the bytes in the keys is not comparable. We could either (a) live with this until we notice it is a problem (overcounting here will just cause the estimates for requests that don't provide any byte size to be smaller), (b) in MVCCGarbageCollect make a tighter guess on the size by applying key prefix compression (and use this to return tokens post-work).

I don't think we necessarily need to do anything about this now, but we should ideally have a common, comparable estimate of the amount of storage work required for a given request.

This could actually come in handy for AddSSTable. To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately. It sounds like this might be a useful mechanism for that.

AddSSTable at the proposer, which is doing the rewriting, is already subject to admission control on master. This means it first goes through the store-specific WorkQueue and consumes the estimated bytes (which will be a wrong number, but which this issue aims to correct), and then the CPU-bound WorkQueue and consumes a slot (that is returned only when it finishes executing). Parallelization of AddSSTable is going to make the latter incorrect since it is actually using multiple slots. If we know the concurrency it is going to use at admission time, this could be fixed.

Nice -- we'll know the concurrency, so we can adjust for it. However, these requests will be split into a CPU-bound evaluation phase (proposer only) and an IO-bound replication/application phase. I believe this is generally the case for most other requests too, as we only replicate/apply the resulting Pebble batch which is IO bound. Does the work queue release the slot after evaluation, or only after the request returns? CC @dt.

sumeerbhola commented 2 years ago

Does the work queue release the slot after evaluation, or only after the request returns?

It returns the slot after evaluation.

sumeerbhola commented 2 years ago

The discussion on https://github.com/cockroachlabs/support/issues/1374 overlaps with the one here -- copy-pasting some stuff here about below-raft throttling, which seems to be generally necessary (due to high rate of applying raft entries when a node restarts and it catching up via the log). There is a debate about whether admission control should be involved.

I'm not sure if admission control is the right tool here though. We're talking about catching up on work that has already been accepted, not throttling new work. Since the only thing we can really do here is to detect overload and slow down log replay, it seems like Pebble should be able to do this on its own by backpressuring writers

I'm not sure Pebble is the right answer for the same reason that admission control for write proposals was placed in CockroachDB: even below raft, there are many replicas sharing the same store, and if there is one replica seeing a large number of low priority index backfills, while another replica is seeing normal user-facing writes, one can reorder them in an admission control WorkQueue below raft. Pebble does not have the knowledge to do such reordering.

(@erikgrinaker let's continue that discussion on this issue)

dt commented 2 years ago

Do we think that a request is the unit of work we want to choose to process now vs later?

I wonder if requests like RevertRange or ExportRequest or even just larger vs smaller ScanRequests complicate things?

Like say we get an ExportRequest with a size limit and a timestamp predicate, and we currently have capacity, so we start evaluating it. As it is evaluations, its iterator keeps finding and opening SSTs, scanning them (or just their indexes) to see what meets the timestamp predicate, and finding a key here or there that meets it, but is not quickly hitting its size limit. In the meantime, a higher priority Get() request comes in, for a query that just needs one key. If our ExportRequest is still churning away, opening and reading blocks and using up all our cpu/disk bandwidth/iops/whatever, that Get() request will be negatively impacted.

Do we want to try to somehow let that Get() ask for priority? Should the longer-running ExportRequest be asking for some sort of quota as it runs e.g. each time its underlying LSM iterator wants to open another SST, or it has used some amount of computation time, or something like that?

erikgrinaker commented 2 years ago

I'm not sure Pebble is the right answer for the same reason that admission control for write proposals was placed in CockroachDB: even below raft, there are many replicas sharing the same store, and if there is one replica seeing a large number of low priority index backfills, while another replica is seeing normal user-facing writes, one can reorder them in an admission control WorkQueue below raft. Pebble does not have the knowledge to do such reordering.

Yeah, this does makes sense when considering multiple ranges and stores, and work priorities between them. We'll need to avoid doublecounting above/below Raft though.

Do you have any thoughts on cross-node admission control? I.e. could it replace the Raft quota pool with knowledge about followers' store health and Raft log state, or would we need to combine/augment them?

Do we think that a request is the unit of work we want to choose to process now vs later?

We should try to be smarter than that.

We have a similar problem with range load metrics, where we currently consider QPS to be the rate of BatchRequests processed (#50620). This is obviously pretty bad -- both because the number of individual requests in a batch varies significantly, and also because the work done by any individual request varies significantly (e.g. Get vs Scan). This is what the allocator uses to decide e.g. replica placement and load-based splits, leading to some pretty bad decisions at times.

We now have multiple differing measures of "work" (QPS and WPS, RUs, admission control tokens), which could get confusing and hard to reason about. We should try to harmonize these somehow -- even though e.g. QPS has the luxury of being after-the-fact and can rely on actual measurements.

sumeerbhola commented 2 years ago

Do we think that a request is the unit of work we want to choose to process now vs later?

We should try to be smarter than that. We have a similar problem with range load metrics, where we currently consider QPS to be the rate of BatchRequests processed (#50620).

I have no objection to being smarter :), (a) if we find that we need to be, based on experiments, and (b) we can figure out something effective.

erikgrinaker commented 2 years ago

We may want to include disk usage protection here as well, to avoid running the node out of disk with e.g. index backfills. Wrote up a proposal in #79210.

tbg commented 2 years ago

What is remaining in scope for this issue after https://github.com/cockroachdb/cockroach/issues/79092#issuecomment-1117372115 is addressed (i.e. at which point bulk requests are "just" throttled in admission control and nowhere else)? I am trying to clean up the many overlapping conversations we are having.

My understanding is that this issue only deals with unifying bulk requests with local admission control (i.e. remove any special casing outside of admission control for these requests). The other big issue we have is https://github.com/cockroachdb/cockroach/issues/79215, for which I just completed cleaning up the initial post and breaking out sub-issues. That issue tracks the short-term (i.e. this release) plan for dealing with appends and snapshots.

Then, there is also https://github.com/cockroachdb/cockroach/issues/79755 which is roughly about how to do "distributed admission control", i.e. taking follower health into account in a less ad-hoc way, and this is for now unplanned (past what's required as part of #79755).

In my understanding this all fits together then, but if what I think this issue is about isn't correct, please correct me.

sumeerbhola commented 2 years ago

What is remaining in scope for this issue after ...

This issue was meant to be an umbrella issue for removing all/most throttling knobs.

irfansharif commented 2 years ago

We should finally nail down what we want here. Read and write bytes from the store, per replica, are definitely within reach. CPU-seconds per replica has been the hard one, without golang changes.

Connecting some dots: see https://github.com/cockroachdb/cockroach/pull/82356 which proposes an instrumented runtime to get fine-grained CPU attribution. We're hoping to do something here for 22.2.

tbg commented 2 years ago

In https://github.com/cockroachdb/cockroach/issues/79215#issuecomment-1168031561 we are discussing an approach to protect followers from below-raft traffic when they're overloaded, we could examine whether this is enough to remove PreIngestDelay from addSSTablePreApply.

We should also consider removing writeFileSyncing from that method, which uses a limiter shared across the store, and which has the potential to severely slow down raft handling across the board (which has severe knock-on effects). Instead, we should stage these writes outside of raft (with proper rate limiting, going through admission control but keeping only a finite buffer & then dropping when over), but that is somewhat more involved.

Filed https://github.com/cockroachdb/cockroach/issues/83485 to add a reliable and test for investigating AddSST overload.

irfansharif commented 2 years ago

This issue has a lot of really good discussion, and there were a lot of follow-on issues filed for specific take aways. I've read through it a few more times to make sure nothing fell through the cracks.

Like say we get an ExportRequest with a size limit and a timestamp predicate, and we currently have capacity, so we start evaluating it. As it is evaluations, its iterator keeps finding and opening SSTs, scanning them (or just their indexes) to see what meets the timestamp predicate, and finding a key here or there that meets it, but is not quickly hitting its size limit. In the meantime, a higher priority Get() request comes in, for a query that just needs one key. If our ExportRequest is still churning away, opening and reading blocks and using up all our cpu/disk bandwidth/iops/whatever, that Get() request will be negatively impacted.

https://github.com/cockroachdb/cockroach/pull/86638 address the backup case using a form of cooperative scheduling; bounding how much work a single export request can do.

To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately.

The detail around AddSST key-rewriting post MVCC-ification, and it being CPU-intensive, is also something that can integrate into the elastic CPU tokens machinery introduced above. There are other factors around disk use noted above and in the relevant issues (a bunch of which are being collected in https://github.com/orgs/cockroachdb/projects/32/views/1). The next set of things (till mid-Jan, after 22.2 stability and EOY holidays) we're planning to look at are focused on index backfills, which will include taking a better look at AddSSTs. We're not looking at using AC for disk storage control soon, or for snapshots (the hope is https://github.com/cockroachdb/pebble/issues/1683 makes it less a problem, which is being worked in in 23.1). The outstanding follower writes throttling is being tracked on the Repl side for now. Throttling of replica/MVCC GC queue is not in near term scope (and perhaps something https://github.com/cockroachdb/cockroach/pull/42514 can push further down the line). All the other discussion around replica load is interesting and can be continued elsewhere.

Closing this issue. We should continue removing throttling knobs, keeping them around just for escalations. Lets file specific issues for specific ones if we've missed any.

daniel-crlabs commented 1 year ago

@lancel66 (Lance Lierheimer) has asked the following question (affecting ZD 15188) about this GH issue:

Question on this: What is the effort to backport #75066 to v22.1? Customer will not be moving to v22.2 for another 6 months or so.

irfansharif commented 1 year ago

@daniel-crlabs, that issues talks latency spikes during backups. That indeed was work that was done under this admission control issue, specifically this PR: https://github.com/cockroachdb/cockroach/pull/86638. It cannot be backported to 22.1 since it uses a patched Go runtime and on a newer release of Go. Feel free to send them this blog post we wrote about this work specifically: https://www.cockroachlabs.com/blog/rubbing-control-theory/.

daniel-crlabs commented 1 year ago

Sounds great, thank you for the response, I'll pass this along to the CEA and the customer.