cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.9k stars 3.78k forks source link

Application Clusters Can Starve the System Cluster #119417

Open jeffswenson opened 7 months ago

jeffswenson commented 7 months ago

Describe the problem

Admission control is managed via several distinct queues:

KV: work admitted at the KV layer KV-SQL: response transferred from the KV layer to the SQL layer SQL-SQL: response transferred from a distsql leaf to a distsql root

When the system is overloaded, tokens are allocated with preference to the lowest level queue. I.E. if the KV queue contains pending work, no tokens will be distributed to the KV-SQL or SQL-SQL queues. This works because starving the SQL queues will eventually reduce the submission rate to the KV queue, which will allow the system to distribute tokens to the SQL queues.

In an external process deployment of CRDB (e.g. serverless), the KV-SQL and SQL-SQL queues live inside the external process SQL server. Which means from the perspective of the system cluster, tenants only submit traffic to the KV queue. This allows tenants to starve the system KV-SQL and SQL-SQL queues, because there is no back pressure to prevent them from submitting work to the KV queue.

To Reproduce

  1. Create a roachprod cluster with at least one external process application cluster.
  2. Run a workload on the application cluster that generates a large amount of cpu admission control queuing at the KV layer. An easy way to achieve this is a write only KV workload with batching.
  3. Pick one of the overloaded KV servers and attempt to connect to the system cluster using a SQL shell. The SQL shell will hang since authorization depends on the KV-SQL queue.

What was the impact?

When a physical cluster is overloaded by an application virtual cluster, the system cluster is unable to process any SQL. It's unclear what the exact impact is, but here are a few user facing consequences:

  1. Tenant RU accounting stalls.
  2. Tenant creation fails.
  3. Backups stall.

Workarounds

External process deployments can work around this by disable the KV-SQL and SQL-SQL queues for the system virtual cluster.

Jira issue: CRDB-36184

jeffswenson commented 7 months ago

cc @sumeerbhola as FYI

sumeerbhola commented 7 months ago

External process deployments can work around this by disable the KV-SQL and SQL-SQL queues for the system virtual cluster.

Is this convenient enough?

Without fixing https://github.com/cockroachdb/cockroach/issues/85471 (which is hard, though there is a promising approach in https://github.com/cockroachdb/cockroach/issues/91536#issuecomment-1753633497) any "fix" in the AC-land would also be about the system tenant bypassing AC for SQLKVResponseWork and SQLSQLResponseWork.

jeffswenson commented 7 months ago

Is this convenient enough?

Yeah, I think this is fine for Serverless. We basically need to run the following commands on each host cluster.

-- set overrides so that admission control remains enabled within the tenant sql servers when
-- we disable it on the host.
ALTER TENANT ALL SET CLUSTER SETTING admission.sql_kv_response.enabled  = true;
ALTER TENANT ALL SET CLUSTER SETTING admission.sql_sql_response.enabled = true;

-- disable sql admission control for the host cluster. 
SET CLUSTER SETTING admission.sql_kv_response.enabled  = false;
SET CLUSTER SETTING admission.sql_sql_response.enabled = false;
jeffswenson commented 7 months ago

Here's a task to track rolling out the mitigation to Cockroach Cloud https://cockroachlabs.atlassian.net/browse/CRDB-36207.