rytaft commented 3 years ago

We have recently had reports of the optimizer choosing a bad plan due to inaccurate or misleading statistics on small tables. This can happen for a couple of reasons:

The table is brand new. We always collect statistics when a table is created, so unless it was created with data already in the table using CREATE TABLE AS ..., the first statistics collection will show that there are 0 rows. To avoid constantly refreshing small tables, we have a setting that prevents refreshing until at least 500 rows have changed. So we won't refresh the stats for this table again until at least 500 rows have been inserted. Therefore, the table could have 499 rows, but the optimizer will still think it has 0 rows. (We actually assume that all tables have at least one row to prevent creating really bad plans, but an efficient plan for 1 row could be very different from an efficient plan for 499 rows).
The size of the table is reflected correctly by the statistics, but scanning the table is much more expensive than we estimate it to be because there are many tombstones in the storage layer. This can happen when a small table is very frequently updated.

To fix the first issue, we have discussed a couple of ideas:

Always assume that every table has at least 500 rows. This has the advantage of being simple, and will probably do the right thing most of the time. It would also help ensure plan stability for small tables (regardless of whether they were created recently or not), and might help fix the second issue above related to tombstones. It could have some unintended consequences, though (e.g., not using a lookup/index join when we should)
We could disable the 500-row check when triggering automatic statistics refreshes until we have at least 4-5 stats refreshes completed. This would give us more accurate stats for brand new tables, but might not help in the case where a table has been around for a while. It would not help with the tombstone issue above either.

To fix the second issue, there are some additional options:

We could actually include the number of tombstones per table as an additional statistic that we collect, and/or collect the average number of keys/bytes scanned or skipped over per table row
We could have some notion of "plan stability", which we could get by modifying the plan cache somehow. For example, instead of just invalidating the plan cache when stats change, we could save the plan for comparison against a new plan, and select the old plan if it's within some small cost factor
We could also just increase the cost of unconstrained, unlimited scans, but that seems like it could cause other problems and lead us to pick worse plans in cases where the table doesn't actually have a lot of tombstones

To deal effectively with this issue, we'll probably want to implement some combination of the above ideas.

cc @RaduBerinde, @nvanbenschoten

Epic: CRDB-16930

Jira issue: CRDB-2921

Jira issue: CRDB-13904

rytaft commented 3 years ago

Always assuming 500 rows seems like a nice and simple solution, but it does have a few complications/drawbacks in addition to the ones mentioned above. In particular, we need to figure out what to do about the column statistics. Column statistics consist of histograms, distinct counts, and null counts, and we use them in the optimizer to estimate the selectivity of predicates, estimate the cardinality of GROUP BY operations, and more. How will we change the histograms, distinct counts, and null counts to match up with the row count of 500? If the stats show that we have X rows, do we simply multiply everything by 500/X?

Presumably we would want to ignore any existing column statistics if the row count is very small (e.g., 0 or 1 rows), but at what point should we actually use those stats? When there are at least 5 rows? 10 rows?

This complication could be one reason to prefer the alternative idea listed above of allowing more frequent refreshes of small tables if there are fewer than 4-5 recent refreshes.

nvanbenschoten commented 3 years ago

It could have some unintended consequences, though (e.g., not using a lookup/index join when we should)

Could you explain why always assuming a table has at least 500 rows would result in not using a lookup join when we should? If anything, I would expect it to have the opposite effect – using a lookup join when we shouldn't.

We could also just increase the cost of unconstrained, unlimited scans, but that seems like it could cause other problems and lead us to pick worse plans in cases where the table doesn't actually have a lot of tombstones

Are there cases that we know about where this would result in worse plans?

rytaft commented 3 years ago

Could you explain why always assuming a table has at least 500 rows would result in not using a lookup join when we should? If anything, I would expect it to have the opposite effect – using a lookup join when we shouldn't.

That could happen if the small table is the input to the lookup join. We might choose a hash or merge join because we think the lookup join will be too expensive, when in fact that would be the better plan.

Are there cases that we know about where this would result in worse plans?

I don't know of a particular issue/example, but I would think that if we had to choose between doing a full table scan of a small table v. doing a bunch of index/lookup joins so that we could do a constrained scan of a larger table, that would not be worth it.

There is definitely some tuning of the overall cost model that we can and should do, but it's dangerous to change the costs of different operators in isolation since it changes the relative costs of everything else. I'm not saying that increasing the cost of unconstrained scans is a bad idea, I just don't think we should do it without considering all the plan changes that will inevitably result.

RaduBerinde commented 3 years ago

Could you explain why always assuming a table has at least 500 rows would result in not using a lookup join when we should? If anything, I would expect it to have the opposite effect – using a lookup join when we shouldn't.

That could happen if the small table is the input to the lookup join. We might choose a hash or merge join because we think the lookup join will be too expensive, when in fact that would be the better plan.

I see, like joining a 10 row table with a 10000 row table. 10 lookup joins is clearly better than a full scan of the big table, but 500 lookup joins could be too expensive. Maybe 500 is too much, maybe it should be around the smallest number where we prefer index joins for single-key accesses (over full table scans).

We could actually include the number of tombstones per table as an additional statistic that we collect, and/or collect the average number of keys/bytes scanned or skipped over per table row

One problem with this is that the tombsones can expire without any mutations to the table, and we won't know to refresh the stats. Maybe it should be a historical average of what we've seen rather than what was there when the stats last ran.

Increasing the cost of unconstrained scans seems too arbitrary to me. A constrained scan can have the same problem (especially something like (/NULL - ])

rytaft commented 3 years ago

Maybe 500 is too much, maybe it should be around the smallest number where we prefer index joins for single-key accesses (over full table scans).

That could work -- we could do a hybrid solution for the first issue: pick a minimum value that's larger than 1 but smaller than 500, and also trigger stats refreshes more frequently for small tables

Maybe it should be a historical average of what we've seen rather than what was there when the stats last ran.

That makes sense. We're already keeping the last 4-5 stats refreshes, so we could calculate the average over all of those refreshes.

To add another data point about how it's difficult to find the correct relative cost between operators: Here's an example where a full table scan is a better plan than doing a constrained scan + index join: https://github.com/cockroachdb/cockroach/issues/46677. This PR was an attempt to fix similar issues by making index joins more expensive: https://github.com/cockroachdb/cockroach/pull/54768.

rytaft commented 3 years ago

Another idea from for the problem of no stats right when a new cluster starts up: determine some crude stats after data is first loaded based on the sizes of the tables on disk. The import process should also know how many rows were created.

RaduBerinde commented 3 years ago

I have been playing around with various changes. Increasing the row count artificially is problematic - you have to figure out a way to do that to the distinct counts as well somehow, otherwise it doesn't work in many cases (e.g. it won't help with choosing lookup joins). A change like this causes many plan changes and seems pretty risky. I am also worried that it would lead to bad plans when tables are legitimately small.

A less promising change seems to be adding a small penalty for full table scans. It is enough to dissuade unnecessary full scans on small tables (e.g. for FKs). It still changes a few plans that will need to be checked.

rytaft commented 3 years ago

A change like this causes many plan changes and seems pretty risky.

This is what I was finding as well when I was working on the fix to #60493 (I initially tried to fix it by artificially inflating the row count).

I think increasing the cost of full scans could help.

I also want to explore the option of triggering more frequent stats refreshes for small tables, perhaps until there are at least 4 stats refreshes (which is how much history we keep around). I might have time to try that out next week..

RaduBerinde commented 3 years ago

61680 is the change that we wanted in 21.1 for this issue. Moving out of the 21.1 bucket.

cockroachdb / cockroach

opt, sql: improve automatic statistics for small tables #56615

61680 is the change that we wanted in 21.1 for this issue. Moving out of the 21.1 bucket.