admission,kv,bulk: unify (local) store overload protection via admission control

sumeerbhola commented 2 years ago

Consider a store with the capacity to accept writes at a rate of R bytes/s. This is a thought exercise in that R is not fixed, and is affected by various factors like disk provisioning (which can dynamically change), whether the write was a batch written via the WAL or an ingested sstable (and how many bytes are landing in L0), and compaction concurrency adjustment. We have two categories of mechanisms that attempt to prevent store overload:

Capacity unaware mechanisms: These include
- Engine.PreIngestDelay: This applies a delay per "file" that is being ingested if we are over rocksdb.ingest_backpressure.l0_file_count_threshold (default of 20). The delay is proportional to how far above the threshold the store is, and uses rocksdb.ingest_backpressure.max_delay. It is unaware of the size of the file, i.e., the number of bytes being written. The bytes being written can vary significantly for bulk operations based on how many ranges are being buffered in BufferingAdder before generating sstables. This delay is applied (a) above raft to AddSSTableRequest even if it is being written as a batch (IngestAsWrites is true), (b) below raft in addSSTablePreApply, if the AddSSTableRequest was !IngestAsWrites.
- Concurrency limit on AddSSTableRequest: Applied at proposal time, using kv.bulk_io_write.concurrent_addsstable_requests, kv.bulk_io_write.concurrent_addsstable_as_writes_requests.
- Concurrency limit on snapshot application (which is done by ingesting sstables): Store.snapshotApplySem.
Capacity aware mechanisms: Admission control uses two overload thresholds admission.l0_sub_level_count_overload_threshold (also 20, like the bulk back-pressuring threshold) and admission.l0_file_count_overload_threshold (1000) to decide when to limit admission control tokens for writes. The code estimates the capacity R based on how fast compactions are removing bytes from L0. It is unaware of the exact bytes that will be added by individual requests and computes an estimate per request based on past behavior. It is used only at proposal time (Node.Batch calls KVAdmissionController.AdmitKVWork).

This setup has multiple deficiencies:

Tuning delay or concurrency knobs is not practical. The knobs will be either (a) too conservative and leave unused throughput, resulting in customer questions on why some bulk operation is running slowly while the nodes are underutilized, (b) too aggressive and we won't even know until a rare workload pushes against those knob settings and causes a node to overload.
Different knobs control different kinds of work. Someone may validate that stressing the system along each kind of work is not causing the system to overload, but it maybe that if all the work kinds spike (e.g. building secondary index and backup restore and snapshot application) that a node becomes overloaded.
Admission control tunes itself but isn't aware of the bytes that will be written by each work item. Estimates are fine for small writes but proper accounting for larger writes is preferable.

We have existing issues for

Using store health of all replicas for bulk operations https://github.com/cockroachdb/cockroach/issues/57247 -- we treat that as orthogonal here and assume that we will continue to need a lower level mechanism that serves as a final safeguard to protect an individual store. That is, we may continue to do below-raft throttling even though there are concerns with doing so due to blocking a worker in the raft scheduler (which should be ok since the worker pool is per store), or holding latches for too long.
https://github.com/cockroachdb/cockroach/issues/57248 discusses throttling of GC operations, that may do many writes -- we already subject these to admission control at the proposer, but without knowledge of the bytes being written. If each GCRequest can do a large batch write we should compute the bytes being written so that admission control can utilize that information.

We propose to unify all these overload protection mechanisms such that there is one source of byte tokens representing what can be admitted and one queue of requests waiting for admission.

Admission control logic will be enhanced to compute byte-based tokens for store writes. Those requests that provide their byte size (which should be all large requests) will consume these tokens. Estimates will be used for two purposes (a) requests that don't provide their byte size, for which the estimate will be used to decide how many tokens to consume (b) computing the fraction of an ingest request that will end up in L0 (to adjust the token consumption for an ingest reques). Just like token estimation, these estimates are continually adjusted based on stats (at every 15s interval).
All the paths that currently used capacity unaware mechanisms will call into the admission control WorkQueue for admission (they will not call the subsequent WorkQueue that throttled KV work based on cpu). After the admitted work is done, each ingest request will also provide information on how many bytes were added to L0 (this will need a small Pebble API change), so that the token consumption can be fixed and we have data for future estimates.
- We will need to prevent double counting: requests that went through admission control above raft at the leaseholder should not be subject to admission control below raft at the same node.
Priority or tenant assignment: An open question is how "background" work like snapshot application or index construction should be prioritized in this unified setup. One could give such work a lower priority if it is indeed lower priority than foreground traffic. There may be exceptions e.g. we may not want a lower priority for below-raft throttling. Alternatively, we could also use a made-up TenantID and rely on the proportional scheduling of tokens done in admission.WorkQueue across tenants. Admission control does not currently have support for different tenant weights, but that is easy to add if needed.

Deficiencies:

The overload threshold at which we want to start constraining writes could be different for user-facing and background operations: By sharing a single queue and a single source of tokens for that queue we also share the same overload thresholds. This is probably not an immediate problem since rocksdb.ingest_backpressure.l0_file_count_threshold and admission.l0_sub_level_count_overload_threshold both default to 20. There is a way to address this in the future via a hierarchical token bucket scheme: the admission.ioLoadListener would produce high_overload_tokens and low_overload_tokens where the background operations have to consume both, while foreground operations only use the former.

cc: @erikgrinaker @dt @nvanbenschoten

Jira issue: CRDB-12450

Epic CRDB-14607

sumeerbhola commented 2 years ago

I've created a WIP PR for the admission control changes (it lacks tests and I need to fix existing tests, hence WIP), for early opinions https://github.com/cockroachdb/cockroach/pull/75120

The most familiar reviewers for the admission control code are @RaduBerinde and @ajwerner, but (a) it may be premature for them to review the PR while we aren't settled on the wider approach, (b) if we go ahead with this we should probably build some expertise in KV and storage with that code base since they are the ones who are going to be diagnosing how this is behaving in production settings.

erikgrinaker commented 2 years ago

Overall, I'm very supportive of this proposal -- I don't see how we can properly avoid overload while maximizing throughput without having all work pass through a common throttling mechanism. As you point out, we'll likely need prioritization here too.

We will need to prevent double counting: requests that went through admission control above raft at the leaseholder should not be subject to admission control below raft at the same node.

There can be a significant time delay between a request being admitted in Node.Batch and when it's actually applied, due to e.g. latching and other queues. Does admission control already take this into account, and/or would it be beneficial with additional below-Raft or just-above-Raft throttling to manage this?

Admission control logic will be enhanced to compute byte-based tokens for store writes. Those requests that provide their byte size (which should be all large requests) will consume these tokens.

Compression complicates this. Most request payloads (e.g. Put) are not compressed, but AddSSTable payloads are, so they will be underweighted compared to other traffic. Do you have any thoughts on how to address that?

they will not call the subsequent WorkQueue that throttled KV work based on cpu

This could actually come in handy for AddSSTable. To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately. It sounds like this might be a useful mechanism for that. CC @dt.

sumeerbhola commented 2 years ago

There can be a significant time delay between a request being admitted in Node.Batch and when it's actually applied, due to e.g. latching and other queues. Does admission control already take this into account, and/or would it be beneficial with additional below-Raft or just-above-Raft throttling to manage this?

It doesn't take this into account. However, the token calculation is done at 15s intervals, so unless the time delay you refer to is many seconds we should have both the admission decision and the storage work happen in the same interval. Are there any queues other than latching and locking? We ideally don't want to delay after acquiring a shared resource (latches/locks) since then we are holding back other work.

Compression complicates this. Most request payloads (e.g. Put) are not compressed, but AddSSTable payloads are, so they will be underweighted compared to other traffic. Do you have any thoughts on how to address that?

The estimate that admission control currently uses for each work item (which is also the case in the PR) is based on compressed bytes https://github.com/cockroachdb/cockroach/blob/bed5b793f9ed223b8133fad72f401ae8d02e8771/pkg/util/admission/granter.go#L1590, so it is comparable with AddSSTableRequest. If there are going to be other heavy-weight write requests (GCRequest?) then just summing the bytes in the keys is not comparable. We could either (a) live with this until we notice it is a problem (overcounting here will just cause the estimates for requests that don't provide any byte size to be smaller), (b) in MVCCGarbageCollect make a tighter guess on the size by applying key prefix compression (and use this to return tokens post-work).

This could actually come in handy for AddSSTable. To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately. It sounds like this might be a useful mechanism for that.

AddSSTable at the proposer, which is doing the rewriting, is already subject to admission control on master. This means it first goes through the store-specific WorkQueue and consumes the estimated bytes (which will be a wrong number, but which this issue aims to correct), and then the CPU-bound WorkQueue and consumes a slot (that is returned only when it finishes executing). Parallelization of AddSSTable is going to make the latter incorrect since it is actually using multiple slots. If we know the concurrency it is going to use at admission time, this could be fixed.

erikgrinaker commented 2 years ago

There can be a significant time delay between a request being admitted in Node.Batch and when it's actually applied, due to e.g. latching and other queues. Does admission control already take this into account, and/or would it be beneficial with additional below-Raft or just-above-Raft throttling to manage this?

It doesn't take this into account. However, the token calculation is done at 15s intervals, so unless the time delay you refer to is many seconds we should have both the admission decision and the storage work happen in the same interval. Are there any queues other than latching and locking?

Latching/locking is a major one, and in the case of long-running transactions these can be blocked for a substantial amount of time (e.g. minutes). If admission control does not take this into account it may be vulnerable to thundering herds. But there are also others, e.g. AddSSTable currently uses a semaphore to limit concurrent requests below the AdminKVWork() call in Node.Batch, and this can wait for a long time as well.

We ideally don't want to delay after acquiring a shared resource (latches/locks) since then we are holding back other work.

Yeah, I've been wondering if we might want a scheme which checks for any throttling right before or after latches have been acquired -- if throttled, the request would release its latches (if acquired) and go back to wait for resources and latches again. Of course, this could be vulnerable to starvation (depending on the queue policy), but it would avoid thundering herds as well as throttling below latches.

Compression complicates this. Most request payloads (e.g. Put) are not compressed, but AddSSTable payloads are, so they will be underweighted compared to other traffic. Do you have any thoughts on how to address that?

The estimate that admission control currently uses for each work item (which is also the case in the PR) is based on compressed bytes

https://github.com/cockroachdb/cockroach/blob/bed5b793f9ed223b8133fad72f401ae8d02e8771/pkg/util/admission/granter.go#L1590

, so it is comparable with AddSSTableRequest. If there are going to be other heavy-weight write requests (GCRequest?) then just summing the bytes in the keys is not comparable. We could either (a) live with this until we notice it is a problem (overcounting here will just cause the estimates for requests that don't provide any byte size to be smaller), (b) in MVCCGarbageCollect make a tighter guess on the size by applying key prefix compression (and use this to return tokens post-work).

I don't think we necessarily need to do anything about this now, but we should ideally have a common, comparable estimate of the amount of storage work required for a given request.

This could actually come in handy for AddSSTable. To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately. It sounds like this might be a useful mechanism for that.

AddSSTable at the proposer, which is doing the rewriting, is already subject to admission control on master. This means it first goes through the store-specific WorkQueue and consumes the estimated bytes (which will be a wrong number, but which this issue aims to correct), and then the CPU-bound WorkQueue and consumes a slot (that is returned only when it finishes executing). Parallelization of AddSSTable is going to make the latter incorrect since it is actually using multiple slots. If we know the concurrency it is going to use at admission time, this could be fixed.

Nice -- we'll know the concurrency, so we can adjust for it. However, these requests will be split into a CPU-bound evaluation phase (proposer only) and an IO-bound replication/application phase. I believe this is generally the case for most other requests too, as we only replicate/apply the resulting Pebble batch which is IO bound. Does the work queue release the slot after evaluation, or only after the request returns? CC @dt.

sumeerbhola commented 2 years ago

Does the work queue release the slot after evaluation, or only after the request returns?

It returns the slot after evaluation.

This is meant to robust to varying IO time/CPU time ratios since the total number of slots are dynamically adjusted to keep the CPU busy.
That said, the assumption in the current integration was that most heavy CPU work would be reads, so proposal evaluation and request return are the same. For writes, we could improve the integration with admission control to call AdmittedWorkDone on the CPU WorkQueue after proposal evaluation and later call AdmittedWorkDone on the IO queue on request return.

sumeerbhola commented 2 years ago

The discussion on https://github.com/cockroachlabs/support/issues/1374 overlaps with the one here -- copy-pasting some stuff here about below-raft throttling, which seems to be generally necessary (due to high rate of applying raft entries when a node restarts and it catching up via the log). There is a debate about whether admission control should be involved.

I'm not sure if admission control is the right tool here though. We're talking about catching up on work that has already been accepted, not throttling new work. Since the only thing we can really do here is to detect overload and slow down log replay, it seems like Pebble should be able to do this on its own by backpressuring writers

I'm not sure Pebble is the right answer for the same reason that admission control for write proposals was placed in CockroachDB: even below raft, there are many replicas sharing the same store, and if there is one replica seeing a large number of low priority index backfills, while another replica is seeing normal user-facing writes, one can reorder them in an admission control WorkQueue below raft. Pebble does not have the knowledge to do such reordering.

(@erikgrinaker let's continue that discussion on this issue)

dt commented 2 years ago

Do we think that a request is the unit of work we want to choose to process now vs later?

I wonder if requests like RevertRange or ExportRequest or even just larger vs smaller ScanRequests complicate things?

Like say we get an ExportRequest with a size limit and a timestamp predicate, and we currently have capacity, so we start evaluating it. As it is evaluations, its iterator keeps finding and opening SSTs, scanning them (or just their indexes) to see what meets the timestamp predicate, and finding a key here or there that meets it, but is not quickly hitting its size limit. In the meantime, a higher priority Get() request comes in, for a query that just needs one key. If our ExportRequest is still churning away, opening and reading blocks and using up all our cpu/disk bandwidth/iops/whatever, that Get() request will be negatively impacted.

Do we want to try to somehow let that Get() ask for priority? Should the longer-running ExportRequest be asking for some sort of quota as it runs e.g. each time its underlying LSM iterator wants to open another SST, or it has used some amount of computation time, or something like that?

erikgrinaker commented 2 years ago

I'm not sure Pebble is the right answer for the same reason that admission control for write proposals was placed in CockroachDB: even below raft, there are many replicas sharing the same store, and if there is one replica seeing a large number of low priority index backfills, while another replica is seeing normal user-facing writes, one can reorder them in an admission control WorkQueue below raft. Pebble does not have the knowledge to do such reordering.

Yeah, this does makes sense when considering multiple ranges and stores, and work priorities between them. We'll need to avoid doublecounting above/below Raft though.

Do you have any thoughts on cross-node admission control? I.e. could it replace the Raft quota pool with knowledge about followers' store health and Raft log state, or would we need to combine/augment them?

Do we think that a request is the unit of work we want to choose to process now vs later?

We should try to be smarter than that.

We have a similar problem with range load metrics, where we currently consider QPS to be the rate of BatchRequests processed (#50620). This is obviously pretty bad -- both because the number of individual requests in a batch varies significantly, and also because the work done by any individual request varies significantly (e.g. Get vs Scan). This is what the allocator uses to decide e.g. replica placement and load-based splits, leading to some pretty bad decisions at times.

We now have multiple differing measures of "work" (QPS and WPS, RUs, admission control tokens), which could get confusing and hard to reason about. We should try to harmonize these somehow -- even though e.g. QPS has the luxury of being after-the-fact and can rely on actual measurements.

sumeerbhola commented 2 years ago

Do we think that a request is the unit of work we want to choose to process now vs later?

We should try to be smarter than that. We have a similar problem with range load metrics, where we currently consider QPS to be the rate of BatchRequests processed (#50620).

I have no objection to being smarter :), (a) if we find that we need to be, based on experiments, and (b) we can figure out something effective.

Making a request ask for resources as it it executing is like yielding in cooperative scheduling. My initial thinking when working on admission control was that it didn't seem necessary for KV and storage work because (a) requests will do bounded work based on bytes etc. to return, (b) the bad case that @dt describes with a lot of scanning work, but no result, so long running, is hopefully rare (and will get rarer with better filtering of irrelevant versions, like with block property filters, and the planned work on sstable format change to separate old versions), (c) there is typically enough resource capacity to run many concurrent requests (e.g. for cpu tokens the system typically ends up 3+ slots per cpu) so there should be enough concurrency for some user-facing work to get in. There are caveats, like (b) above, and making sure the priority assignment is actually sane (or use synthetic tenants if we want some measure of fair sharing). There has been some discussion about yielding for SQL work in admission control, which can be less bounded in terms of cpu -- it hasn't happened yet (see https://docs.google.com/document/d/1Hg1nrpEm3Cj0eSFRLHgDEJEVzQBKfEUs0WyYfI59i24/edit?disco=AAAAM15RbCE)
I believe that admission control is doing better than range load metrics for 2 reasons:
- The cpu accounting is done via slots (not request rate), so a longer running request holds onto the slot for longer.
- The write IO accounting is computing the actual mean of bytes added and using it for forecasting per request. We can of course do even better here, as this issue outlined earlier.
Regarding what we should do for range load metrics, this seems to be a perennial topic. We should finally nail down what we want here. Read and write bytes from the store, per replica, are definitely within reach. CPU-seconds per replica has been the hard one, without golang changes. Perhaps admission slot-seconds can be a proxy, with the caveat that different replicas may have different amount of io-time or contention-wait-time (I don't mention scheduler-wait-time since admission control reduces this, and it is same across all replicas at a node). Subtracting contention-wait-time is not hard since there is one place to instrument (concurrency package) and it happens once per request. We'll still end up with something that can be critiqued as "little better than walltime" but I suspect that it is substantially better for the common case.

erikgrinaker commented 2 years ago

We may want to include disk usage protection here as well, to avoid running the node out of disk with e.g. index backfills. Wrote up a proposal in #79210.

tbg commented 2 years ago

What is remaining in scope for this issue after https://github.com/cockroachdb/cockroach/issues/79092#issuecomment-1117372115 is addressed (i.e. at which point bulk requests are "just" throttled in admission control and nowhere else)? I am trying to clean up the many overlapping conversations we are having.

My understanding is that this issue only deals with unifying bulk requests with local admission control (i.e. remove any special casing outside of admission control for these requests). The other big issue we have is https://github.com/cockroachdb/cockroach/issues/79215, for which I just completed cleaning up the initial post and breaking out sub-issues. That issue tracks the short-term (i.e. this release) plan for dealing with appends and snapshots.

Then, there is also https://github.com/cockroachdb/cockroach/issues/79755 which is roughly about how to do "distributed admission control", i.e. taking follower health into account in a less ad-hoc way, and this is for now unplanned (past what's required as part of #79755).

In my understanding this all fits together then, but if what I think this issue is about isn't correct, please correct me.

sumeerbhola commented 2 years ago

What is remaining in scope for this issue after ...

This issue was meant to be an umbrella issue for removing all/most throttling knobs.

irfansharif commented 2 years ago

We should finally nail down what we want here. Read and write bytes from the store, per replica, are definitely within reach. CPU-seconds per replica has been the hard one, without golang changes.

Connecting some dots: see https://github.com/cockroachdb/cockroach/pull/82356 which proposes an instrumented runtime to get fine-grained CPU attribution. We're hoping to do something here for 22.2.

tbg commented 2 years ago

In https://github.com/cockroachdb/cockroach/issues/79215#issuecomment-1168031561 we are discussing an approach to protect followers from below-raft traffic when they're overloaded, we could examine whether this is enough to remove PreIngestDelay from addSSTablePreApply.

We should also consider removing writeFileSyncing from that method, which uses a limiter shared across the store, and which has the potential to severely slow down raft handling across the board (which has severe knock-on effects). Instead, we should stage these writes outside of raft (with proper rate limiting, going through admission control but keeping only a finite buffer & then dropping when over), but that is somewhat more involved.

Filed https://github.com/cockroachdb/cockroach/issues/83485 to add a reliable and test for investigating AddSST overload.

irfansharif commented 2 years ago

This issue has a lot of really good discussion, and there were a lot of follow-on issues filed for specific take aways. I've read through it a few more times to make sure nothing fell through the cracks.

Like say we get an ExportRequest with a size limit and a timestamp predicate, and we currently have capacity, so we start evaluating it. As it is evaluations, its iterator keeps finding and opening SSTs, scanning them (or just their indexes) to see what meets the timestamp predicate, and finding a key here or there that meets it, but is not quickly hitting its size limit. In the meantime, a higher priority Get() request comes in, for a query that just needs one key. If our ExportRequest is still churning away, opening and reading blocks and using up all our cpu/disk bandwidth/iops/whatever, that Get() request will be negatively impacted.

https://github.com/cockroachdb/cockroach/pull/86638 address the backup case using a form of cooperative scheduling; bounding how much work a single export request can do.

To make it MVCC-compliant, we will in some cases be rewriting the MVCC timestamps in the SST, which is a CPU-bound operation. We would ideally like to be able to throttle the CPU-bound and IO-bound portions of AddSSTable requests separately.

The detail around AddSST key-rewriting post MVCC-ification, and it being CPU-intensive, is also something that can integrate into the elastic CPU tokens machinery introduced above. There are other factors around disk use noted above and in the relevant issues (a bunch of which are being collected in https://github.com/orgs/cockroachdb/projects/32/views/1). The next set of things (till mid-Jan, after 22.2 stability and EOY holidays) we're planning to look at are focused on index backfills, which will include taking a better look at AddSSTs. We're not looking at using AC for disk storage control soon, or for snapshots (the hope is https://github.com/cockroachdb/pebble/issues/1683 makes it less a problem, which is being worked in in 23.1). The outstanding follower writes throttling is being tracked on the Repl side for now. Throttling of replica/MVCC GC queue is not in near term scope (and perhaps something https://github.com/cockroachdb/cockroach/pull/42514 can push further down the line). All the other discussion around replica load is interesting and can be continued elsewhere.

Closing this issue. We should continue removing throttling knobs, keeping them around just for escalations. Lets file specific issues for specific ones if we've missed any.

daniel-crlabs commented 1 year ago

@lancel66 (Lance Lierheimer) has asked the following question (affecting ZD 15188) about this GH issue:

Question on this: What is the effort to backport #75066 to v22.1? Customer will not be moving to v22.2 for another 6 months or so.

irfansharif commented 1 year ago

@daniel-crlabs, that issues talks latency spikes during backups. That indeed was work that was done under this admission control issue, specifically this PR: https://github.com/cockroachdb/cockroach/pull/86638. It cannot be backported to 22.1 since it uses a patched Go runtime and on a newer release of Go. Feel free to send them this blog post we wrote about this work specifically: https://www.cockroachlabs.com/blog/rubbing-control-theory/.

daniel-crlabs commented 1 year ago

Sounds great, thank you for the response, I'll pass this along to the CEA and the customer.

cockroachdb / cockroach

admission,kv,bulk: unify (local) store overload protection via admission control #75066