Open RaduBerinde opened 3 years ago
Given that statistics are always out of date, and they are an estimate, would it be reasonable to put a limit on how low statistics can reduce an estimated row count? For example, what if statistics could lower the row count estimate to no less than 1; only contradictions are guaranteed to make a row count 0. In this case it might have to be a lower limit of 1 plus some epsilon, so that the cost doesn't match the primary index lookup.
It would be nice to get to this in the 21.2 release. cc @kevin-v-ngo @awoods187 for visibility.
I'm pro doing this in 21.2 provided the opportunity cost isn't too high. Let's discuss level of effort during the milestone planning meeting
Saw another instance of this in the wild. The use case involved a "status" column and a partial index that restricts the status to "pending". The vast majority of rows have "done" status, with occasionally a few thousand rows showing as "pending". The automatic stats can happen to see zero "pending" rows or see a few thousand of them, depending on timing. When stats show zero rows, the partial index is estimated to be empty and becomes an enticing index for the optimizer. In this case, the reliable plan was to use the PK which guaranteed that we scan at most one row.
Unfortunately it seems like the fix with the cost penalty is not robust enough to handle some variations on this issue. For example, if the alternative plan has cardinality 10, we may not choose it.
This is going to require some more thought, so I'll reopen this issue for now.
Just saw another variation on this. It's similar to the example from Radu above:
Saw another instance of this in the wild. The use case involved a "status" column and a partial index that restricts the status to "pending". The vast majority of rows have "done" status, with occasionally a few thousand rows showing as "pending". The automatic stats can happen to see zero "pending" rows or see a few thousand of them, depending on timing. When stats show zero rows, the partial index is estimated to be empty and becomes an enticing index for the optimizer. In this case, the reliable plan was to use the PK which guaranteed that we scan at most one row.
However, it's slightly different. There is an updated_at
column and a FK-like column other_id
. The query fetches recently updated rows with a specific other_id
, e.g. WHERE other_id = 1234 AND updated_at > now() - '1 hour'
. There's a 12-bucket hash sharded index on updated_at
and a regular secondary index on other_id
.
The histograms for the updated_at
column do not include the most recently updated values, so the optimizer prefers to scan the index on updated_at
. It estimates that it will only scan 6 rows, but in reality it scans over 100k.
The better plan would be to scan the index on other_id
which is estimated to scan only about 150 rows, and in reality only scans 2 rows.
Unfortunately, cardinality cannot help us here because other_id
is not unique.
I've run into problems like this running Postgres in the past. Our solution was to make automated stats collections much more aggressive. If a table gets very large, automatic stats collection is very unlikely to run if it is triggered by some % of rows being mutated.
Imagine a table that gets 100k inserts per day. It's been around for 1000 days so it now has 100m rows. With our default sql.stats.automatic_collection.fraction_stale_rows
at 0.2
, 25m rows need to be inserted to trigger automatic stats (25m stale rows / 125m total rows = 0.2), which won't be for 250 days. That's 25m rows and 250 days worth of values that are missing from histograms.
To make automatic stats much more aggressive, you can set sql.stats.automatic_collection.fraction_stale_rows
to 0
and set sql.stats.automatic_collection.min_stale_rows
to some number, say 10k. This ensures that no matter how big the table is, you're guaranteed to collect stats for the most recent rows every so often.
Postgres has the ability adjust these stats knobs at the table level. I don't believe we have that ability yet, but it would be useful for this; a user needs the ability to tune these knobs at a per table level based on the workload.
Here's how to tune auto-stats collection in Postgres for a specific table:
ALTER TABLE t SET (autovacuum_vacuum_scale_factor = 0);
ALTER TABLE t SET (autovacuum_vacuum_threshold = 10000);
Another idea from Radu is to ignore histograms when building statistics if they show a uniform distribution. Histograms are most helpful if values are skewed. When they are not skewed, using distinct counts, null counts, and row counts might be just as good or better, and it could mitigate the case I last mentioned.
I think the most robust solution would be to monitor a sample of updates and incrementally update histograms to account for newly added values since the last stats collection. I found a paper on the topic: Fast Incremental Maintenance of Approximate Histograms.
The only approach used to date for histogram updates, which is followed in nearly all commercial systems, is to recompute histograms periodically (e.g., every night). This approach has two disadvantages. First, any significant updates to the data between two recomputations could cause poor estimations in the optimizer. Second, since the histograms are recomputed from scratch by discarding the old histograms, the recomputation phase for the entire database is computationally very intensive.
To extend @mgartner 's idea, I wonder if we could fit the histograms to a data distribution ( normal or if we go extra miles for logistic/poisson for example ) and from there use the distribution ( mean/std for normal distribution) to estimate the row count? There're classical statistics measure/methods we could borrow to fit a distribution.
We already have an issue to track per-table automatic statistics collection: https://github.com/cockroachdb/cockroach/issues/40989
We ran into this with another customer, very similar to the example I posted above.
One idea from the customer was to introduce a "max age" setting for statistics which would guarantee that stats would run once the previous stats hit this max age, ignoring the other settings sql.stats.automatic_collection.fraction_stale_row
and sql.stats.automatic_collection.min_stale_rows
.
cc @vy-ton for visibility
As the customer mentioned in the original post, max age would be good. Also having a command to force refresh the automatic statistics (without having to make new ones) would be great too. Other databases have this as well.
Also having a command to force refresh the automatic statistics (without having to make new ones) would be great too. Other databases have this as well.
This is possible with ANALYZE table;
. Or is the behavior different than what you are describing?
@mgartner Didn't know about this. From the documentation I am not clear what this is doing. Is this updating statistics or creating new ones?
Also, I ran it and it failed.
batch timestamp 1628283892.315459871,0 must be after replica GC threshold 1628283892.326056946,0
@bladefist there is no difference. We always create new statistics, there is no separate mechanism to update existing statistics.
Do you have a very low TTL on that table? ANALYZE by default reads as of timestamp -30seconds (I believe). The smallest supported TTL is on the order of minutes.
You can also run CREATE STATISTICS directly, the end result is the same.
900 seconds. I'll increase it. I'm guessing creating statistics is taking longer than the TTL of the table.
What version are you running? A while ago we made it so that the read timestamp increases if it takes a long time.
@RaduBerinde 21.1.6
Can you file a separate issue for that error? I think we may only be using that mechanism when stats are triggerred automatically, but we should do it for ANALYZE too.
@RaduBerinde #68590 and #68591
Manually synced with Jira
I've created https://github.com/cockroachdb/cockroach/issues/83431 specifically for the ascending key problem. There are other problems described in this issue that won't necessarily be fixed by fixing the ascending key problem, like this one.
https://github.com/cockroachdb/cockroach/issues/84461 is in a similar vein.
Consider the hypothesis that unreliable stats usually produce estimated row counts of exactly zero. If this were true, simply treating zero-row estimates as a special case might improve things. For example, if we got a zero-row estimate using histograms, then we could recalculate the estimate with histograms disabled (as mentioned above).
So tonight I attempted to use telemetry data to prove or disprove this hypothesis. Using a rough criteria for query-executions-with-unreliable-stats, I grouped all such executions from the past month by log-base-2 of estimated rows:
estimated rows | percent of query-executions-with-unreliable-stats | percent of distinct-queries-with-unreliable-stats |
---|---|---|
0 | 36.8% | 7.4% |
1 | 1.9% | 3.9% |
2 to 3 | 4.6% | 1.5% |
4 to 7 | 4.0% | 1.9% |
8 to 15 | 0.8% | 6.4% |
16 to 31 | 0.9% | 1.9% |
32 to 63 | 13.9% | 2.1% |
64 to 127 | 17.7% | 3.9% |
128 to 255 | 1.3% | 4.0% |
256 to 511 | 1.1% | 5.4% |
512 to 1023 | 2.4% | 8.4% |
1024 to 2047 | 2.2% | 12.1% |
2028 to 4095 | 1.5% | 11.2% |
4096 to 8191 | 1.8% | 10.0% |
8192 to 16383 | 1.4% | 12.0% |
16384 to 32767 | 1.1% | 17.2% |
32768 to 65535 | 0.9% | 15.6% |
65536 to 131071 | 2.7% | 9.0% |
131072 to 262143 | 0.5% | 4.0% |
(rest of results omitted)
An estimate of zero rows is a strong signal, but special casing this alone won't fix the general problem.
Nice analysis! Out of curiosity, what was the definition you used for "unreliable" stats?
Nice analysis! Out of curiosity, what was the definition you used for "unreliable" stats?
I was trying to copy https://github.com/cockroachdb/cockroach/blob/6b6a8fb68fcdb008e5be095d5aa82d9462eaf6a1/pkg/sql/opt/exec/explain/emit.go#L473-L476
so the conditions I used were:
...
AND STATS_AVAILABLE
AND TOTAL_SCAN_ROWS_ESTIMATE IS NOT NULL
AND ROWS_READ IS NOT NULL
AND ROWS_READ > 2048
AND ((ROWS_READ * 2 + 100) < TOTAL_SCAN_ROWS_ESTIMATE OR (TOTAL_SCAN_ROWS_ESTIMATE * 2 + 100) < ROWS_READ)
...
I just saw another case of this which is very similar to the one described in my comment above. The gist of it is:
CREATE TABLE t (
k INT PRIMARY KEY,
a STRING,
b STRING,
c TIMESTAMP,
UNIQUE INEX (a, b),
INDEX (a, c)
)
SELECT * FROM t WHERE a = 'value outside max value in histogram' AND b = 'foo';
There are many rows in the table where a = 'value outside max value in histogram'
, but the histograms don't reflect that because they are not up-to-date.
The optimizer considers two plans. A fast one:
index-join t
├── columns: k a b c
└── scan order@t_a_b_idx
├── columns: k a b
├── constraint: /a/b: [/'value outside max value in histogram'/'foo' - /'value outside max value in histogram'/'foo']
└── stats: [rows=1]
And a slow one:
select
├── columns: k a b c
├── index-join t
│ ├── columns: k a b c
│ └── scan order@t_a_c_idx
│ ├── columns: k a c
│ ├── constraint: /a/c: [/'value outside max value in histogram' - /'value outside max value in histogram']
│ └── stats: [rows=0.147284]
└── filters
└── b = 'foo'
The index-joins dominate the cost for these plans, so the slow one is picked because performing an index join on 0.147284 rows is cheaper than performing an index join on 1 row.
It seems like we run into more trouble when we underestimate the row count than when we overestimate it.
Shouldn't it be safer to assume that filtering by a value outside of the histogram range produces some rows rather than zero rows? What if we consider the ranges (-infinity - min)
and (max - infinity)
to be "buckets" that have the same distinct_range
and num_range
as the other buckets in the histogram? I believe that would cover a lot of these cases in this ticket.
I've noticed that the problem is more likely to present when there are many columns or columns with a high avg_size
. The cost of an index join is heavily swayed by the sum of the lookup column avg_size
s. Perhaps this is a clue that we need to lessen the impact of avg_size
on costing?
Shouldn't it be safer to assume that filtering by a value outside of the histogram range produces some rows rather than zero rows? What if we consider the ranges (-infinity - min) and (max - infinity) to be "buckets" that have the same distinct_range and num_range as the other buckets in the histogram? I believe that would cover a lot of these cases in this ticket.
Here's a POC that shows this idea: https://github.com/cockroachdb/cockroach/pull/97999
A distinct advantage to this over something like incremental stats collection is that it always will work. Incremental stats can be collected more frequently, but they'll still be stale some of the time, and slow query plans of this form are possible. This approach is also far simpler to implement.
So what are the downsides? In theory, over-estimating the number of matching rows outside of the histogram bounds could lead to slower query plans in some cases. I need to spend more time trying to concoct an example that proves such a regression is possible. I'm (maybe over optimistically) doubtful that this will actually happen in practice. Like I mentioned above, it seems that under-estimating row counts is a far greater problem than over-estimating them.
Hi how do we estimate row counts for auto-incremental columns like int as part of PK? For example PK is (col1 int, col2, int). I just opened a support case on behalf of customer and support may open an escalation later. Since col1 is auto-incremental, a new value in col1 will always be out of col1 boundary. I wonder if we should use average row counts for col1, instead of 0 in this case. Thank you!
Hi how do we estimate row counts for auto-incremental columns like int as part of PK? For example PK is (col1 int, col2, int). I just opened a support case on behalf of customer and support may open an escalation later. Since col1 is auto-incremental, a new value in col1 will always be out of col1 boundary. I wonder if we should use average row counts for col1, instead of 0 in this case. Thank you!
To help with auto-increment columns, and other columns with predictable histogram changes, we added statistics forecasting in v22.2 (as described in #79872).
(If the problem is on v22.1, one way to check whether forecasting will help is to capture a statement bundle, and then use cockroach debug statement-bundle recreate
with a v22.2 binary, and EXPLAIN
the query again to see if forecasting is used.)
This is great to hear! @michae2 we'll validate it
@michae2 Statistics forecasting doesn't work for this unfortunately, if one of the columns doesn't increase consistently.
@vbabenkoru Could you share a bit more on how the column is used? Perhaps on slack channel so we can discuss specifics on your use case?
@mgartner's proposal would work in our case I believe, since forecasting isn't always possible or reliable, and the default behavior of assuming 0 rows whenever it encounters a new value is the main issue here.
@vbabenkoru You mean the proposal here, correct? I have a prototype of it that I'm hoping to find time to finish soon.
Ran into this on the DRT drt-chaos
cluster, so adding O-testcluster
. Internal conversation
A customer ran into a case where they were doing a single row UPSERT. Instead of choosing the primary index (which would scan at most 1 row), the optimizer is choosing a secondary index. The index is chosen because according to the histogram, the relevant value would have no rows. But the stats are stale and the query actually reads through 100k+ rows from the index.
We have discussed augmenting the cost value with an "uncertainty range", which would address this problem (primary index has <=1 row with 100% certainty, the secondary index has expected 0 rows but with no upper bound). This would be a big change; but I believe we can also consider a more targeted fix, e.g. we could give a heavy cost "discount" to scan operators which have a known cardinality bound (or a penalty to scans with no cardinality bound).
Below is an illustration. The t_y_v_idx index could in principle return any number of rows.
The correct plan would be:
gz#9150
Jira issue: CRDB-7134
Jira issue: CRDB-13889
gz#16142
gz#18109