Open irfansharif opened 2 years ago
In full agreement here -- this is similar to how I've been thinking about the problem in various previous in-person discussions with @kvoli et al. Just some slight elaboration:
Copying over some (edited) internal notes around what a litmus test for this system could be. We don't need to immediately start off thinking about “how quickly the allocator responds”, since that’s always tunable, and work we can defer. It’s more important to ensure that it’s doing the right thing at all, independent of the time horizon. So what I think we should frame this as needs to be purely in terms of steady state response and resource awareness. For a fixed workload (TPC-C) against a statically sized cluster (5 nodes perhaps) where data isn’t pinned anywhere (so allocator has utmost flexibility), we'll be able to find find a warehouse count such that CPU and disk bandwidth use is substantial % of what’s provisioned on each node or cluster-wide. The litmus test is then as follows:
This framing tells us a few things:
How quickly it does such things is good to think about after we have this basic structure in place, we just don't today. If we can nail this steady state response, it makes it easier to talk about “workload shifts” or the sort. Those things are harder to do unless we have these fundamental building blocks. I understand our concerns around "multi-dimensional rebalancing", so if we need to keep it to a single dimension to start off (say just CPU), then reword the KR to introduce the resource limit on just CPU. If you need it in terms of disk bandwidth, then do the same for that. But it needs to be a fundamental resource at some level.
Is your feature request related to a problem? Please describe.
Allocation in CRDB is in terms of an abstract '# of batch requests' unit which as a measure can be fairly divorced from actual hardware consumption. It's difficult to tune (impossible to normalize to capacity), reason about, and lends to awkward calibration in practice (https://github.com/cockroachdb/cockroach/pull/76252).
Describe the solution you'd like
Modelling allocation directly in terms of resource utilization without collapsing the different resource dimensions (disk bandwidth, IOPS, CPU) into a single unit. Allocation should also be thought as operating on a layer "above" admission control -- AC introduces artificial delays to prevent node overload and ignoring this throttling would prevent us from distinguishing between two replicas with an identical rate of resource use where one of them could be pushing a rate much higher were it placed elsewhere with headroom. At a high-level, we should develop + use measures for:
It's worth exploring approaches for what allocation could look like when juggling discrete resource dimensions.
Additional context
This issue is a resuscitation of https://github.com/cockroachdb/cockroach/issues/34590 with more words.
Jira issue: CRDB-17098