Closed irfansharif closed 1 year ago
Or something simpler (+backportable) shorter term that aims for a bandwidth target and paces incoming batch requests accordingly.
@dt suggested a fairly straightforward short-term protection here in https://cockroachlabs.slack.com/archives/C03JUKU58F3/p1654640595999019. We may want to add such a throttling knob for each admissionpb.WorkPriority
, as others like TTLLowPri
might benefit from it.
@andrewbaptist made the point that these kinds of throttling mechanisms are good to have even after more sophisticated pacing has landed. They can default to an unlimited rate, but they're often a valuable tool when things go wrong and an operator wants to control the system more directly.
@dt suggested a fairly straightforward short-term protection here in
For posterity:
func (s *Store) maybeThrottleBatch(
ctx context.Context, ba roachpb.BatchRequest,
) (limit.Reservation, error) {
...
if ba.AdmissionHeader.Priority == int32(admissionpb.BulkNormalPri) {
before := timeutil.Now()
if err := s.limiters.BulkBatchRate.WaitN(ctx, ba.TotalSize()); err != nil {
return nil, err
}
_ = timeutil.Since(before)
// todo: collect a metric, log long waits.
}
We may want to add such a throttling knob for each admissionpb.WorkPriority, as others like TTLLowPri might benefit from it.
Maybe one setting for "what is the priority threshold that should be subject to the low-pri limiter?" and then anything at/below that gets tossed at a single low-pri batch size limiter? I could see lots of separate individual limits being very hard to set correctly to add up to just the right number.
Lowering bulkio.column_backfill.batch_size
works pretty effectively as a short-term mitigation. I'm going to leave this issue open for the more general per-store pacing of backfills through admission control (+cc https://github.com/cockroachdb/cockroach/pull/82813).
https://github.com/cockroachdb/cockroach/issues/95563 is the tracking issue for admission control work.
Is your feature request related to a problem? Please describe.
Using this issue to track the general case of index/column backfill induced performance impact.
In support escalations (https://github.com/cockroachlabs/support/issues/1628) we've observed that column backfills for a large table was able to consume all available disk write bandwidth on stores (caps out at 150mb/s in the graph below, what the store was provisioned with), resulting in starvation for foreground requests on those stores. The bandwidth saturation led to log commit p99s in the order of seconds (see graph below).
In internal experimentation (#admission-control) we've also observed throughput/latency effects due to aggressive follower write activity.
Describe the solution you'd like
Disbursing byte-sized IO tokens over time for requests serving these large + long running bulk operations, controlling how much bandwidth use for background operations/ensuring foreground traffic has available capacity. Or something simpler (+backportable) shorter term that aims for a bandwidth target and paces incoming batch requests accordingly. Or perhaps introducing simpler client side knobs to control the rates at which we issue these requests to KV.
Additional context
Relates broadly to https://github.com/cockroachdb/cockroach/issues/75066 + https://github.com/cockroachdb/cockroach/issues/79092. Unclear if addressed by https://github.com/cockroachdb/cockroach/pull/82440, need a repro.
Jira issue: CRDB-16542