Open jeffswenson opened 7 months ago
cc @sumeerbhola as FYI
External process deployments can work around this by disable the KV-SQL and SQL-SQL queues for the system virtual cluster.
Is this convenient enough?
Without fixing https://github.com/cockroachdb/cockroach/issues/85471 (which is hard, though there is a promising approach in https://github.com/cockroachdb/cockroach/issues/91536#issuecomment-1753633497) any "fix" in the AC-land would also be about the system tenant bypassing AC for SQLKVResponseWork and SQLSQLResponseWork.
Is this convenient enough?
Yeah, I think this is fine for Serverless. We basically need to run the following commands on each host cluster.
-- set overrides so that admission control remains enabled within the tenant sql servers when
-- we disable it on the host.
ALTER TENANT ALL SET CLUSTER SETTING admission.sql_kv_response.enabled = true;
ALTER TENANT ALL SET CLUSTER SETTING admission.sql_sql_response.enabled = true;
-- disable sql admission control for the host cluster.
SET CLUSTER SETTING admission.sql_kv_response.enabled = false;
SET CLUSTER SETTING admission.sql_sql_response.enabled = false;
Here's a task to track rolling out the mitigation to Cockroach Cloud https://cockroachlabs.atlassian.net/browse/CRDB-36207.
Describe the problem
Admission control is managed via several distinct queues:
KV: work admitted at the KV layer KV-SQL: response transferred from the KV layer to the SQL layer SQL-SQL: response transferred from a distsql leaf to a distsql root
When the system is overloaded, tokens are allocated with preference to the lowest level queue. I.E. if the KV queue contains pending work, no tokens will be distributed to the KV-SQL or SQL-SQL queues. This works because starving the SQL queues will eventually reduce the submission rate to the KV queue, which will allow the system to distribute tokens to the SQL queues.
In an external process deployment of CRDB (e.g. serverless), the KV-SQL and SQL-SQL queues live inside the external process SQL server. Which means from the perspective of the system cluster, tenants only submit traffic to the KV queue. This allows tenants to starve the system KV-SQL and SQL-SQL queues, because there is no back pressure to prevent them from submitting work to the KV queue.
To Reproduce
What was the impact?
When a physical cluster is overloaded by an application virtual cluster, the system cluster is unable to process any SQL. It's unclear what the exact impact is, but here are a few user facing consequences:
Workarounds
External process deployments can work around this by disable the KV-SQL and SQL-SQL queues for the system virtual cluster.
Jira issue: CRDB-36184