Open alyshanjahani-crl opened 1 year ago
I ran into something that may be related. I was testing ru/s limits and token bucket exhaustion. When I started draining one of the kv nodes in the cluster for a configuration change, a sql server got stuck with the following log line:
W230928 20:27:05.192839 8329 ccl/multitenantccl/tenantcostclient/tenant_side.go:552 ⋮ [T7,nsql2] 774 TokenBucket RPC error: ‹×›: cannot acquire lease when draining
W230928 20:27:06.192972 8350 ccl/multitenantccl/tenantcostclient/tenant_side.go:552 ⋮ [T7,nsql2] 775 TokenBucket RPC error: ‹×›: cannot acquire lease when draining
W230928 20:27:07.194779 8356 ccl/multitenantccl/tenantcostclient/tenant_side.go:552 ⋮ [T7,nsql2] 776 TokenBucket RPC error: ‹×›: cannot acquire lease when draining
W230928 20:27:08.193944 8357 ccl/multitenantccl/tenantcostclient/tenant_side.go:552 ⋮ [T7,nsql2] 777 TokenBucket RPC error: ‹×›: cannot acquire lease when draining
Describe the problem
A SQL pod/process that becomes throttled remains throttled even once plenty of tokens are filled in the global/distributed token bucket (
system.tenant_usage
).To Reproduce
Set up a multi-tenant CRDB cluster, create a secondary tenant (in this example tenant id=2), and user/password for tenant. On the system tenant, set the resource limits for tenant-2 such that it has a small amount of RUs and no refill rate
Initialize a schema heavy workload (tpcc). This consumes more than 10000 RUs and the sql pod/process will get throttled (the loading of schemas will hang)
On the system tenant, you'll notice that the token bucket for the tenant is in debt (negative value)
Set the token bucket to be full again with a large value like 100M RUs
The command is still hanging, and other connections and queries fail/hang as well. However, if we spin up a new SQL pod for the tenant, connections and queries on that pod will succeed.
Expected behavior
The queries on the SQL process should be able to make progress again once tokens have been filled in the global token bucket for that tenant.
It seems like the SQL process is not reaching out to KV / the global token bucket to refill its local token bucket with the newly added tokens as described here
Environment:
Jira issue: CRDB-27133