cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.5k stars 3.7k forks source link

sql: cancel duplicate stats requests once unblocked by other long-running stats requests #123588

Open mgartner opened 2 months ago

mgartner commented 2 months ago

When a gateway node makes the probabilistic decision to request stats for a table, it will schedule a retry for that request if there is already another stats job running. The retry is scheduled here by adding a mutation with rowsAffected=math.MaxInt32 to the mutations channel:

https://github.com/cockroachdb/cockroach/blob/3794f0e7511765473fe38efb612a46cd8fb564aa/pkg/sql/stats/automatic_stats.go#L808-L813

That retry is always guaranteed to eventually collect stats successfully, by this conditional:

https://github.com/cockroachdb/cockroach/blob/3794f0e7511765473fe38efb612a46cd8fb564aa/pkg/sql/stats/automatic_stats.go#L789-L792

Assume there is a long-running stats job that blocks stats requests on table1. If the job runs long enough, it is possible, even probably in some cases, that more than on gateway node will request stats on another table, table2. Once the long-running stats job for table1 completes, each node requesting stats on table2 will race to start a new job, and although one will win, the other nodes' requests will not be canceled once stats have been collected on table2. As a result, stats will be collected for every node that decided to request stats, even if stats were just collected.

I could imagine situations in which many stats would be collected for the same table, one right after the other.

To fix this, I believe we need the retry mechanism to consider whether stats for the table were recently collected. Or perhaps there is another suitable heuristic.

Jira issue: CRDB-38411

mgartner commented 2 months ago

This might actually be mitigated by:

https://github.com/cockroachdb/cockroach/blob/3794f0e7511765473fe38efb612a46cd8fb564aa/pkg/sql/stats/automatic_stats.go#L753-L773

and

https://github.com/cockroachdb/cockroach/blob/3794f0e7511765473fe38efb612a46cd8fb564aa/pkg/sql/stats/automatic_stats.go#L799-L806

But I'm not 100% sure. I believe this only mitigates the problem if the long-running job took longer than table2's 2 * avgFullRefreshTime(table2) + r.extraTime, where r.extraTime is a random interval in the range [0, 1hr). Given that avgFullRefreshTime could be quite long, this may be unlikely to occur.