cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.58k stars 3.71k forks source link

multitenant: Throttled queries do not make progress once distributed token bucket is refilled #101813

Open alyshanjahani-crl opened 1 year ago

alyshanjahani-crl commented 1 year ago

Describe the problem

A SQL pod/process that becomes throttled remains throttled even once plenty of tokens are filled in the global/distributed token bucket (system.tenant_usage).

To Reproduce

Set up a multi-tenant CRDB cluster, create a secondary tenant (in this example tenant id=2), and user/password for tenant. On the system tenant, set the resource limits for tenant-2 such that it has a small amount of RUs and no refill rate

// for tenant with id=2, set 10000 RUs, 0 refill rate
select crdb_internal.update_tenant_resource_limits(2, 10000, 0, 0, now(), 0);

Initialize a schema heavy workload (tpcc). This consumes more than 10000 RUs and the sql pod/process will get throttled (the loading of schemas will hang)

alyshanjahani@crlMBP-C02ZP0C2MD6TMTIy ~ % cockroach workload init tpcc "postgresql://<user>:<password>@localhost:26257/tpcc?sslmode=require"
I230418 17:14:20.448375 1 workload/workloadsql/dataload.go:146  [-] 1  imported warehouse (0s, 1 rows)
I230418 17:14:20.526387 1 workload/workloadsql/dataload.go:146  [-] 2  imported district (0s, 10 rows)
// at this point the command is hanging

On the system tenant, you'll notice that the token bucket for the tenant is in debt (negative value)

root@localhost:26257/defaultdb> select ru_current from system.tenant_usage where instance_id=0 and tenant_id=2;
-[ RECORD 1 ]
ru_current        | -37985.90773410235

Set the token bucket to be full again with a large value like 100M RUs

select crdb_internal.update_tenant_resource_limits(2, 100000000, 0, 0, now(), 0);

The command is still hanging, and other connections and queries fail/hang as well. However, if we spin up a new SQL pod for the tenant, connections and queries on that pod will succeed.

Expected behavior

The queries on the SQL process should be able to make progress again once tokens have been filled in the global token bucket for that tenant.

It seems like the SQL process is not reaching out to KV / the global token bucket to refill its local token bucket with the newly added tokens as described here

Environment:

Jira issue: CRDB-27133

JeffSwenson commented 9 months ago

I ran into something that may be related. I was testing ru/s limits and token bucket exhaustion. When I started draining one of the kv nodes in the cluster for a configuration change, a sql server got stuck with the following log line:

W230928 20:27:05.192839 8329 ccl/multitenantccl/tenantcostclient/tenant_side.go:552 ⋮ [T7,nsql2] 774  TokenBucket RPC error: ‹×›: cannot acquire lease when draining
W230928 20:27:06.192972 8350 ccl/multitenantccl/tenantcostclient/tenant_side.go:552 ⋮ [T7,nsql2] 775  TokenBucket RPC error: ‹×›: cannot acquire lease when draining
W230928 20:27:07.194779 8356 ccl/multitenantccl/tenantcostclient/tenant_side.go:552 ⋮ [T7,nsql2] 776  TokenBucket RPC error: ‹×›: cannot acquire lease when draining
W230928 20:27:08.193944 8357 ccl/multitenantccl/tenantcostclient/tenant_side.go:552 ⋮ [T7,nsql2] 777  TokenBucket RPC error: ‹×›: cannot acquire lease when draining