allocator: Enhance allocator to evenly distribute ranges for a given tenant

andy-kimball commented 2 years ago

Today, the allocator is not aware of multi-tenancy when distributing ranges across KV nodes. This means that ranges for a given tenant can "bunch up" on a single node, or a small number of nodes. That, in turn, can lead to performance bottlenecks, since we try to limit the maximum utilization of a single tenant to 20% of a KV node. An individual tenant may have hit the 20% limit, but the allocator takes no action, because the node is under-utilized from a macro point of view.

Ideally, the allocator would try to distribute each tenant's ranges evenly across KV nodes, just as it tries to evenly distribute ranges across available zones and regions.

Jira issue: CRDB-13805

andy-kimball commented 2 years ago

CC @lunevalex

andy-kimball commented 2 years ago

There may be other solutions we should consider to this problem. This issue is intended to be a place where we can discuss further.

sumeerbhola commented 1 year ago

Ideally, the allocator would try to distribute each tenant's ranges evenly across KV nodes, just as it tries to evenly distribute ranges across available zones and regions.

I think the above is insufficient. If we evenly spread the ranges of a tenant across the nodes we can still have the ranges that see a load spike be concentrated on a few nodes. If the tenant_rate_limiter starts throttling on those nodes and the allocator does nothing, that is a problem.

I suppose the this will usually work out since the allocator tries to achieve even cpu usage (though there are other resources like store write bandwidth which are not considered). But we could get unlucky in that this node may have recently been commissioned, so was at 10%, and now the surging tenant has increased that to 30%, and the tenant is being throttled, but since the mean across the nodes is 50% the allocator will not shed load from this node. I think we have 2 options:

The allocator explicitly needs to take tenant_rate_limiter signals into account and move the affected tenant.
The tenant_rate_limiter should not be throttling the tenant when the node resource utilization is not high (or not much higher than the cluster mean). Then the natural allocator rebalancing mechanism along with provisioning (that keeps the mean at a reasonable value) are sufficient.
- Admission control is an alternative to the tenant_rate_limiter, but AC thresholds are tuned to permit resource saturation, do don't provide good latency isolation between tenants. But they could be tuned lower, and potentially still achieve ~70% resource utilization.

IMO, the second option is preferable.

cockroachdb / cockroach

allocator: Enhance allocator to evenly distribute ranges for a given tenant #77869