allocator,admission: consider resource utilization + throttling signals directly

irfansharif commented 2 years ago

Is your feature request related to a problem? Please describe.

Allocation in CRDB is in terms of an abstract '# of batch requests' unit which as a measure can be fairly divorced from actual hardware consumption. It's difficult to tune (impossible to normalize to capacity), reason about, and lends to awkward calibration in practice (https://github.com/cockroachdb/cockroach/pull/76252).

Describe the solution you'd like

Modelling allocation directly in terms of resource utilization without collapsing the different resource dimensions (disk bandwidth, IOPS, CPU) into a single unit. Allocation should also be thought as operating on a layer "above" admission control -- AC introduces artificial delays to prevent node overload and ignoring this throttling would prevent us from distinguishing between two replicas with an identical rate of resource use where one of them could be pushing a rate much higher were it placed elsewhere with headroom. At a high-level, we should develop + use measures for:

what kind of utilization any given node is observing for each resource dimension (easy for CPU, more difficult for bandwidth+IOPS unless provisioned amount is provided, or measured directly during process start -- TODO; necessary to understand what headroom is available or could to be made available through rebalancing)
what resource dimension(s) are observing saturation and on what nodes (can be surfaced through admission control);
some attribution of resource use to individual ranges/tenants per-store (easy for disk bandwidth + IOPs if ignoring the page cache and subtracting effects of the block cache, possible for CPU with https://github.com/cockroachdb/cockroach/pull/82356)
the rate of throttling experienced per-replica/tenant because of saturation (useful to forecast effect of a lease transfer or replica movement).

It's worth exploring approaches for what allocation could look like when juggling discrete resource dimensions.

Additional context

This issue is a resuscitation of https://github.com/cockroachdb/cockroach/issues/34590 with more words.

Jira issue: CRDB-17098

sumeerbhola commented 2 years ago

In full agreement here -- this is similar to how I've been thinking about the problem in various previous in-person discussions with @kvoli et al. Just some slight elaboration:

Utilization helps both in deciding whether to shed load, and which store/node needs to shed more urgently.
- Disk/Store utilization has been a difficulty here in the past. My earlier thought had been that we could use measured bandwidth and IOPS and use L0 sublevels as a crude way to map to utilization. Say 2 sublevels is 50% and 10 is 100% and linearly interpolate between these. If we have provisioned disk bandwidth (as assumed in https://github.com/cockroachdb/cockroach/pull/82813), we can also measure that utilization, and mix it in. We can't only use the latter since max compaction concurrency can result in store overload without disk overload.
Saturation is just a function of utilization. We may not need an explicit signal of whether AC is active and throttling.
Regarding attribution to usage to each range, I'm not too worried about errors of undercounting (because AC has already throttled) or overcounting (because cpu usage of scans is higher due to higher read amp) since if the cluster is adequately provisioned, we will pick an underloaded node to move to.
- For disk bandwidth, we can simply count the observed write bandwidth (batch size) per range and read bandwidth (scan block bytes minus those in the block cache) and apportion the lower-level disk bandwidth to ranges in the proportion of these higher level read and write bandwidth numbers.

irfansharif commented 2 years ago

Copying over some (edited) internal notes around what a litmus test for this system could be. We don't need to immediately start off thinking about “how quickly the allocator responds”, since that’s always tunable, and work we can defer. It’s more important to ensure that it’s doing the right thing at all, independent of the time horizon. So what I think we should frame this as needs to be purely in terms of steady state response and resource awareness. For a fixed workload (TPC-C) against a statically sized cluster (5 nodes perhaps) where data isn’t pinned anywhere (so allocator has utmost flexibility), we'll be able to find find a warehouse count such that CPU and disk bandwidth use is substantial % of what’s provisioned on each node or cluster-wide. The litmus test is then as follows:

If we introduce heterogeneity in this cluster to specific node(s) (cgroup limit of X% of CPU or Y MB/s of disk bandwidth), and we have CPU and/or disk bandwidth headroom available elsewhere, the allocator must be able to use it. When capping it to X or Y, we need to be targeting numbers less than what that node was driving before it was capped. When capping it, AC will kick in. If allocation acts properly, AC will no longer need to kick in.

This framing tells us a few things:

We need to identify exactly what resource is the bottleneck.
We need to identify which replica/lease we need to move off to help alleviate that specific resource bottleneck. So we need some form of attribution of CPU and disk bandwidth to specific replicas/leases.
It’s not possible to achieve “balance” by design. Since resource availability is heterogenous. It’s not clear what balance across multiple resource dimensions even means.
We can’t react to something like '# of batch requests', because the capped node is going to do less work by design. We need to be work conserving and use it to its fullest extent and no more.

How quickly it does such things is good to think about after we have this basic structure in place, we just don't today. If we can nail this steady state response, it makes it easier to talk about “workload shifts” or the sort. Those things are harder to do unless we have these fundamental building blocks. I understand our concerns around "multi-dimensional rebalancing", so if we need to keep it to a single dimension to start off (say just CPU), then reword the KR to introduce the resource limit on just CPU. If you need it in terms of disk bandwidth, then do the same for that. But it needs to be a fundamental resource at some level.

cockroachdb / cockroach

allocator,admission: consider resource utilization + throttling signals directly #83490