cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

kv,bulkio: throttle per-store column/index backfill requests #82556

Closed irfansharif closed 1 year ago

irfansharif commented 2 years ago

Is your feature request related to a problem? Please describe.

Using this issue to track the general case of index/column backfill induced performance impact.

In support escalations (https://github.com/cockroachlabs/support/issues/1628) we've observed that column backfills for a large table was able to consume all available disk write bandwidth on stores (caps out at 150mb/s in the graph below, what the store was provisioned with), resulting in starvation for foreground requests on those stores. The bandwidth saturation led to log commit p99s in the order of seconds (see graph below).

image image

In internal experimentation (#admission-control) we've also observed throughput/latency effects due to aggressive follower write activity.

Describe the solution you'd like

Disbursing byte-sized IO tokens over time for requests serving these large + long running bulk operations, controlling how much bandwidth use for background operations/ensuring foreground traffic has available capacity. Or something simpler (+backportable) shorter term that aims for a bandwidth target and paces incoming batch requests accordingly. Or perhaps introducing simpler client side knobs to control the rates at which we issue these requests to KV.

Additional context

Relates broadly to https://github.com/cockroachdb/cockroach/issues/75066 + https://github.com/cockroachdb/cockroach/issues/79092. Unclear if addressed by https://github.com/cockroachdb/cockroach/pull/82440, need a repro.

Jira issue: CRDB-16542

nvanbenschoten commented 2 years ago

Or something simpler (+backportable) shorter term that aims for a bandwidth target and paces incoming batch requests accordingly.

@dt suggested a fairly straightforward short-term protection here in https://cockroachlabs.slack.com/archives/C03JUKU58F3/p1654640595999019. We may want to add such a throttling knob for each admissionpb.WorkPriority, as others like TTLLowPri might benefit from it.

@andrewbaptist made the point that these kinds of throttling mechanisms are good to have even after more sophisticated pacing has landed. They can default to an unlimited rate, but they're often a valuable tool when things go wrong and an operator wants to control the system more directly.

irfansharif commented 2 years ago

@dt suggested a fairly straightforward short-term protection here in

For posterity:

func (s *Store) maybeThrottleBatch(
    ctx context.Context, ba roachpb.BatchRequest,
) (limit.Reservation, error) {
    ...
        if ba.AdmissionHeader.Priority == int32(admissionpb.BulkNormalPri) {
            before := timeutil.Now()
            if err := s.limiters.BulkBatchRate.WaitN(ctx, ba.TotalSize()); err != nil {
                return nil, err
            }
            _ = timeutil.Since(before)
            // todo: collect a metric, log long waits.
        }
dt commented 2 years ago

We may want to add such a throttling knob for each admissionpb.WorkPriority, as others like TTLLowPri might benefit from it.

Maybe one setting for "what is the priority threshold that should be subject to the low-pri limiter?" and then anything at/below that gets tossed at a single low-pri batch size limiter? I could see lots of separate individual limits being very hard to set correctly to add up to just the right number.

irfansharif commented 2 years ago

Lowering bulkio.column_backfill.batch_size works pretty effectively as a short-term mitigation. I'm going to leave this issue open for the more general per-store pacing of backfills through admission control (+cc https://github.com/cockroachdb/cockroach/pull/82813).

irfansharif commented 2 years ago

x-linking https://github.com/cockroachdb/cockroach/issues/85641 + https://github.com/cockroachdb/cockroach/issues/83826 + #73979.

irfansharif commented 1 year ago

https://github.com/cockroachdb/cockroach/issues/95563 is the tracking issue for admission control work.

irfansharif commented 1 year ago

https://github.com/cockroachdb/cockroach/issues/95563 is done.