sql: stats collection doesn't get re-tried after a server shutdown

knz commented 1 year ago

Discovered by @rytaft

Describe the problem

The automatic stats collection (used by SQL query planning) is triggered lazily when tables encounter mutations.

This lazy mechanism is based on an in-memory FIFO (a Go channel).

If a server gets shut down after a SQL mutation completes but before the mutation was popped from this FIFO (and thus before the job to compute the stats has been created), there will never be any re-computation.

This can result in bad query plans for tables that had a bunch of changes just before a node shuts down and then never change again.

To Reproduce

Issue a large number of mutations to a table and issue a server shutdown in the middle of these mutations. Then restart the server and do not modify the table again. Notice that the stats remain out of date forever.

Expected behavior

Either persist the in-flight FIFO queue of mutated tables so they get a chance to see their jobs created when the server starts again.
Wait during graceful shutdown until all the jobs are created, but without running them. (This could delay the drain in an unacceptable way if there were many tables. Testing should evaluate this overhead)

Environment:

crdb v23.1 and master branch

Jira issue: CRDB-26459

rytaft commented 1 year ago

Thanks for filing, @knz!

I think this potential problem has existed since auto stats were first introduced, but we haven't seen it become a major problem in practice. Putting this on the backlog, but we should keep an eye out in case this does become a bigger problem.

knz commented 1 year ago

One concrete example of this bug is the risk to leave stale entries for dropped tables in system.table_statistics. See #108813.

michae2 commented 1 year ago

One concrete example of this bug is the risk to leave stale entries for dropped tables in system.table_statistics. See #108813.

Cleaning up after dropped tables was recently fixed by https://github.com/cockroachdb/cockroach/pull/105364.

yuzefovich commented 1 year ago

One concrete example of this bug is the risk to leave stale entries for dropped tables in system.table_statistics.

This seems orthogonal to the original issue. Mutations don't currently drive whether the stats for dropped tables are removed (and that has been fixed by #105364 as Michael mentioned). The original issue indicating that FIFO of mutations' notifications are not persisted across node restarts, and as a result we might not collect new table stats even though we should, that issue is still present.

cockroachdb / cockroach

sql: stats collection doesn't get re-tried after a server shutdown #100482