Currently a caller sending KV requests to the KV server more or less only knows how long they wait for the response. If 30% of the time the caller was waiting for the response took was because AC ensured the request waited for CPU or IO resources, that is not currently visible to the caller. When a caller like BACKUP or LDR is issuing many such requests, it may appear slow to the user who ran it or is observing it, but currently the job has no mechanism to determine that it is being slowed due to a specific resource capacity constraint, or to communicate that to the user who could act on that.
The user can observe the overall cluster overload metrics, which could give some clues that something is being delayed due to CPU or IO overload, but these metrics describe all work across all nodes, which can make the effect any specific job is seeing harder to determine. These metrics also are grouped by the node delaying work, rather than by the work delayed, further complicating the relationship between them and any specific job.
Of course we can also collect traces to of specific requests to follow where they spend time, including in queue delays, in detail. This too however is not a perfect fit for an operation like a job that sends thousands of requests: tracing the entire execution of every request just to know CPU or IO limiting's aggregate impact is prohibitively expensive and produces a vast amount of trace information that is not useful or relevant to job unless a user is actively tracing it.
Instead, simply pulling out two or perhaps three broad categories where a request may be delayed due to user-controllable resources like CPU, IO and perhaps contention/latching and passing this information in a simple duration to the caller to aggregate could allow jobs and other user-facing operations to present clear, user-actionable messages directly to the user when they interact with the job or operation.
Currently a caller sending KV requests to the KV server more or less only knows how long they wait for the response. If 30% of the time the caller was waiting for the response took was because AC ensured the request waited for CPU or IO resources, that is not currently visible to the caller. When a caller like BACKUP or LDR is issuing many such requests, it may appear slow to the user who ran it or is observing it, but currently the job has no mechanism to determine that it is being slowed due to a specific resource capacity constraint, or to communicate that to the user who could act on that.
The user can observe the overall cluster overload metrics, which could give some clues that something is being delayed due to CPU or IO overload, but these metrics describe all work across all nodes, which can make the effect any specific job is seeing harder to determine. These metrics also are grouped by the node delaying work, rather than by the work delayed, further complicating the relationship between them and any specific job.
Of course we can also collect traces to of specific requests to follow where they spend time, including in queue delays, in detail. This too however is not a perfect fit for an operation like a job that sends thousands of requests: tracing the entire execution of every request just to know CPU or IO limiting's aggregate impact is prohibitively expensive and produces a vast amount of trace information that is not useful or relevant to job unless a user is actively tracing it.
Instead, simply pulling out two or perhaps three broad categories where a request may be delayed due to user-controllable resources like CPU, IO and perhaps contention/latching and passing this information in a simple duration to the caller to aggregate could allow jobs and other user-facing operations to present clear, user-actionable messages directly to the user when they interact with the job or operation.
Jira issue: CRDB-44335