Open irfansharif opened 1 year ago
@irfansharif The graphs below are runnable goroutines, used_slots, total_slots in that order. AC does initially increase total_slots, but when the runnable count starts becoming too high, it reduces total_slots. Even after the reduction total_slots is comparable to that of other nodes. The shape of the runnable spike matches the used_slots spike, which suggests that the bypassed work is the one that is runnable -- which means some amount of the bypassed work is not stuck on latches.
We could potentially change the AC code such that bypassed work does not consume slots. It will still result in high runnable counts, and have a secondary effect on total_slots. This can be bad since we do decrement total_slots when usedSlots <= total
https://github.com/cockroachdb/cockroach/blob/f9a99a9fa018a20fa7d8a66f1d670610495d1fb2/pkg/util/admission/kv_slot_adjuster.go#L71-L82
and currently that effect is contained since usedSlots
is high. One possibility is to have usedSlots
only count non-bypassed work and use that to grant, but for total_slots adjustment in kvSlotAdjuster
use usedSlots + bypassedSlots <= total
. This should be an easy (and back-portable) change.
A reproduction should probably precede such a change, and also after we merge https://github.com/cockroachdb/cockroach/pull/96511
We could potentially change the AC code such that bypassed work does not consume slots.
Note that this suggestion (ignoring the resource consumption of bypassing work) conflicts with the idea in https://github.com/cockroachlabs/support/issues/2102 to subject intent resolution to admission control (in that case it was IO admission control, but the idea generally applies to any resource). It would be easier to make a decision if we knew whether this bypassing work was potentially expensive or not. With IO admission control we are shaping the rate, so we know if the work turned out to be expensive. With CPU slots, we do have CPU consumption measurements now, but don't use it.
Describe the problem
During txn conflicts/deadlock detection, we issue PushTxns^1 and QueryTxns^2 as requests that bypass AC (note the missing AC headers in the references below). Bypassing work still consume AC slots/tokens^3, so can completely starve out all other work that integrates with AC. This can contribute to amplified latencies during txn contention due to excessive AC queueing time. We saw this in an internal support case, where disabling AC got rid of the ~1s mean latencies we were adding to each request: https://github.com/cockroachlabs/support/issues/2160. There are other possible causes for AC to introduce high latencies during txn contention, especially when a large txn contends with many smaller ones.
From internal discussion: Was AC making these txn deadlocks worse by (a) slowing down txns which increases the possibility of a timing-related locking deadlock, and (b) slowing down distributed deadlock detection itself?
To Reproduce
Run a small workload with inverted locks to run into this deadlock scenario, and see AC-bypassed QueryTxn/PushTxn work starving out foreground work.
Jira issue: CRDB-25786
Epic CRDB-25469