Open mgartner opened 2 months ago
This might actually be mitigated by:
and
But I'm not 100% sure. I believe this only mitigates the problem if the long-running job took longer than table2
's 2 * avgFullRefreshTime(table2) + r.extraTime
, where r.extraTime
is a random interval in the range [0, 1hr)
. Given that avgFullRefreshTime could be quite long, this may be unlikely to occur.
When a gateway node makes the probabilistic decision to request stats for a table, it will schedule a retry for that request if there is already another stats job running. The retry is scheduled here by adding a
mutation
withrowsAffected=math.MaxInt32
to themutations
channel:https://github.com/cockroachdb/cockroach/blob/3794f0e7511765473fe38efb612a46cd8fb564aa/pkg/sql/stats/automatic_stats.go#L808-L813
That retry is always guaranteed to eventually collect stats successfully, by this conditional:
https://github.com/cockroachdb/cockroach/blob/3794f0e7511765473fe38efb612a46cd8fb564aa/pkg/sql/stats/automatic_stats.go#L789-L792
Assume there is a long-running stats job that blocks stats requests on
table1
. If the job runs long enough, it is possible, even probably in some cases, that more than on gateway node will request stats on another table,table2
. Once the long-running stats job fortable1
completes, each node requesting stats ontable2
will race to start a new job, and although one will win, the other nodes' requests will not be canceled once stats have been collected ontable2
. As a result, stats will be collected for every node that decided to request stats, even if stats were just collected.I could imagine situations in which many stats would be collected for the same table, one right after the other.
To fix this, I believe we need the retry mechanism to consider whether stats for the table were recently collected. Or perhaps there is another suitable heuristic.
Jira issue: CRDB-38411