Open jeffswenson opened 11 months ago
Another angle: Why crash with exiting heartbeat loop with error: session record deleted
? Is crashing is a useful thing to do here? It likely increases the load from the tenant due to retries plus the temp loss of availability of the SQL pods. FWIW, KV doesn't crash in case KV liveness is lost. Is it possible to follow such that pattern here?
BTW, this issue increases alerting noise on the SRE pages, since many causes of SQL pod restarts are actionable for CRL -- that is, they are not a workload issue.
The crash occurs because sql server's can't change their instance_id and the instance_id is leased by the sql server.
Since this issue was originally written we have increased how much kv time each tenant is able to consume per kv server. So it is difficult for tenants to trigger it in production. As admission control improves, I suspect we will eventually remove the kv rate limiter all together and rely on admission control + rebalancing to deal with hot spots.
Multi-tenant external process sql server deployments of CRDB use a KV side tenant rate limiter to control how much of a single kv server a tenant can consume. The rate limiter is distinct from admission control because it tries to limit a tenant to a percentage of node capacity even if there is slack capacity. Admission control on the other hand will only kick in if the node is observing side effects of overload like increased go routine queue depth or LSM inversion.
If a tenant is running into the kv side rate limits, it can starve critical traffic like sql liveness and sql leases. Starving sql liveness causes sql servers to crash with the following error log: "exiting heartbeat loop with error: session record deleted". This is not the only cause of the "exiting heartbeat loop" error. The error occurs whenever a sql server is unable to communicate with the kv layer for more than 30 seconds.
Critical internal traffic should be exempt from tenantrate limiter.
Jira issue: CRDB-33986
gz#21258