cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

allocator,admission: consider resource utilization + throttling signals directly #83490

Open irfansharif opened 2 years ago

irfansharif commented 2 years ago

Is your feature request related to a problem? Please describe.

Allocation in CRDB is in terms of an abstract '# of batch requests' unit which as a measure can be fairly divorced from actual hardware consumption. It's difficult to tune (impossible to normalize to capacity), reason about, and lends to awkward calibration in practice (https://github.com/cockroachdb/cockroach/pull/76252).

Describe the solution you'd like

Modelling allocation directly in terms of resource utilization without collapsing the different resource dimensions (disk bandwidth, IOPS, CPU) into a single unit. Allocation should also be thought as operating on a layer "above" admission control -- AC introduces artificial delays to prevent node overload and ignoring this throttling would prevent us from distinguishing between two replicas with an identical rate of resource use where one of them could be pushing a rate much higher were it placed elsewhere with headroom. At a high-level, we should develop + use measures for:

It's worth exploring approaches for what allocation could look like when juggling discrete resource dimensions.

Additional context

This issue is a resuscitation of https://github.com/cockroachdb/cockroach/issues/34590 with more words.

Jira issue: CRDB-17098

sumeerbhola commented 2 years ago

In full agreement here -- this is similar to how I've been thinking about the problem in various previous in-person discussions with @kvoli et al. Just some slight elaboration:

irfansharif commented 2 years ago

Copying over some (edited) internal notes around what a litmus test for this system could be. We don't need to immediately start off thinking about “how quickly the allocator responds”, since that’s always tunable, and work we can defer. It’s more important to ensure that it’s doing the right thing at all, independent of the time horizon. So what I think we should frame this as needs to be purely in terms of steady state response and resource awareness. For a fixed workload (TPC-C) against a statically sized cluster (5 nodes perhaps) where data isn’t pinned anywhere (so allocator has utmost flexibility), we'll be able to find find a warehouse count such that CPU and disk bandwidth use is substantial % of what’s provisioned on each node or cluster-wide. The litmus test is then as follows:

This framing tells us a few things:

How quickly it does such things is good to think about after we have this basic structure in place, we just don't today. If we can nail this steady state response, it makes it easier to talk about “workload shifts” or the sort. Those things are harder to do unless we have these fundamental building blocks. I understand our concerns around "multi-dimensional rebalancing", so if we need to keep it to a single dimension to start off (say just CPU), then reword the KR to introduce the resource limit on just CPU. If you need it in terms of disk bandwidth, then do the same for that. But it needs to be a fundamental resource at some level.