Open aadityasondhi opened 3 months ago
Just to clarify, TTL does not do a full table scan, per se. Each leaseholder node of the table scans only the ranges it holds the lease for. There are two levels of concurrency:
GOMAXPROCS
goroutines working concurrently.This parallelism is probably one of the reasons why read bandwidth can get saturated.
Also, each of the goroutines mentioned above will delete the keys from the range immediately after it scans the range. Does the TTL code also need to integrate with the disk write traffic limiter?
We have seen full table scans that occur as part of row-level TTL cause overload by saturating the read bandwidth on a node.
Note: as of https://github.com/cockroachdb/cockroach/pull/124184, the default ttl_delete_rate_limit is being set to 100 (used to be unlimited). For resolving this issue, we may want to set that cluster setting back to unlimited in the tests that cover this issue.
Jira issue: CRDB-37530
Epic CRDB-37479