admission: bypassing QueryTxn/PushTxn work can starve out non-bypassing AC work

irfansharif commented 1 year ago

Describe the problem

During txn conflicts/deadlock detection, we issue PushTxns^1 and QueryTxns^2 as requests that bypass AC (note the missing AC headers in the references below). Bypassing work still consume AC slots/tokens^3, so can completely starve out all other work that integrates with AC. This can contribute to amplified latencies during txn contention due to excessive AC queueing time. We saw this in an internal support case, where disabling AC got rid of the ~1s mean latencies we were adding to each request: https://github.com/cockroachlabs/support/issues/2160. There are other possible causes for AC to introduce high latencies during txn contention, especially when a large txn contends with many smaller ones.

There could be short spikes in runnable g’s due to contending pushers, which causes AC to reduce slot count aggressively.
AC could be too slow-reacting when increasing slot count, which we might need if suddenly requests start holding/waiting on latches longer than they were previously, due to emergent contention. So until we react, we have a lower effective slot count and are forcing a lower concurrency for non-blocked requests, forcing a lower CPU utilization. And the time taken to increase slot count is felt in latency for requests waiting for available AC slots. (https://github.com/cockroachdb/cockroach/pull/96511 will improve visibility here.)

From internal discussion: Was AC making these txn deadlocks worse by (a) slowing down txns which increases the possibility of a timing-related locking deadlock, and (b) slowing down distributed deadlock detection itself?

To Reproduce

Run a small workload with inverted locks to run into this deadlock scenario, and see AC-bypassed QueryTxn/PushTxn work starving out foreground work.

Jira issue: CRDB-25786

Epic CRDB-25469

sumeerbhola commented 1 year ago

@irfansharif The graphs below are runnable goroutines, used_slots, total_slots in that order. AC does initially increase total_slots, but when the runnable count starts becoming too high, it reduces total_slots. Even after the reduction total_slots is comparable to that of other nodes. The shape of the runnable spike matches the used_slots spike, which suggests that the bypassed work is the one that is runnable -- which means some amount of the bypassed work is not stuck on latches.

We could potentially change the AC code such that bypassed work does not consume slots. It will still result in high runnable counts, and have a secondary effect on total_slots. This can be bad since we do decrement total_slots when usedSlots <= total https://github.com/cockroachdb/cockroach/blob/f9a99a9fa018a20fa7d8a66f1d670610495d1fb2/pkg/util/admission/kv_slot_adjuster.go#L71-L82 and currently that effect is contained since usedSlots is high. One possibility is to have usedSlots only count non-bypassed work and use that to grant, but for total_slots adjustment in kvSlotAdjuster use usedSlots + bypassedSlots <= total. This should be an easy (and back-portable) change.

A reproduction should probably precede such a change, and also after we merge https://github.com/cockroachdb/cockroach/pull/96511

sumeerbhola commented 1 year ago

We could potentially change the AC code such that bypassed work does not consume slots.

Note that this suggestion (ignoring the resource consumption of bypassing work) conflicts with the idea in https://github.com/cockroachlabs/support/issues/2102 to subject intent resolution to admission control (in that case it was IO admission control, but the idea generally applies to any resource). It would be easier to make a decision if we knew whether this bypassing work was potentially expensive or not. With IO admission control we are shaping the rate, so we know if the work turned out to be expensive. With CPU slots, we do have CPU consumption measurements now, but don't use it.

cockroachdb / cockroach

admission: bypassing QueryTxn/PushTxn work can starve out non-bypassing AC work #99253