opt: avoid choosing index with unreliable stats

RaduBerinde commented 3 years ago

A customer ran into a case where they were doing a single row UPSERT. Instead of choosing the primary index (which would scan at most 1 row), the optimizer is choosing a secondary index. The index is chosen because according to the histogram, the relevant value would have no rows. But the stats are stale and the query actually reads through 100k+ rows from the index.

We have discussed augmenting the cost value with an "uncertainty range", which would address this problem (primary index has <=1 row with 100% certainty, the secondary index has expected 0 rows but with no upper bound). This would be a big change; but I believe we can also consider a more targeted fix, e.g. we could give a heavy cost "discount" to scan operators which have a known cardinality bound (or a penalty to scans with no cardinality bound).

Below is an illustration. The t_y_v_idx index could in principle return any number of rows.

> SET CLUSTER SETTING sql.stats.automatic_collection.enabled = false;
> create table t (x int, y int, v int, index (y, v), primary key (x,y));
> insert into t values (1, 1, 1), (2, 2, 2), (3, 3, 3);
> create statistics foo from t;
> explain upsert into t values (10, 10, 10);
                                             info
----------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • upsert
  │ into: t(x, y, v)
  │ auto commit
  │ arbiter indexes: primary
  │
  └── • cross join (left outer)
      │ estimated row count: 1
      │
      ├── • values
      │     size: 3 columns, 1 row
      │
      └── • filter
          │ estimated row count: 1
          │ filter: x = 10
          │
          └── • scan
                estimated row count: 0 (<0.01% of the table; stats collected 32 seconds ago)
                table: t@t_y_v_idx
                spans: [/10 - /10]

The correct plan would be:

demo@127.0.0.1:26257/defaultdb> explain upsert into t values (1, 1, 1);
                                         info
---------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • upsert
  │ into: t(x, y, v)
  │ auto commit
  │ arbiter indexes: primary
  │
  └── • cross join (left outer)
      │ estimated row count: 1
      │
      ├── • values
      │     size: 3 columns, 1 row
      │
      └── • scan
            estimated row count: 1 (31% of the table; stats collected 15 seconds ago)
            table: t@primary
            spans: [/1/1 - /1/1]
            locking strength: for update

gz#9150

Jira issue: CRDB-7134

Jira issue: CRDB-13889

gz#16142

gz#18109

mgartner commented 3 years ago

Given that statistics are always out of date, and they are an estimate, would it be reasonable to put a limit on how low statistics can reduce an estimated row count? For example, what if statistics could lower the row count estimate to no less than 1; only contradictions are guaranteed to make a row count 0. In this case it might have to be a lower limit of 1 plus some epsilon, so that the cost doesn't match the primary index lookup.

mgartner commented 3 years ago

It would be nice to get to this in the 21.2 release. cc @kevin-v-ngo @awoods187 for visibility.

awoods187 commented 3 years ago

I'm pro doing this in 21.2 provided the opportunity cost isn't too high. Let's discuss level of effort during the milestone planning meeting

RaduBerinde commented 3 years ago

Saw another instance of this in the wild. The use case involved a "status" column and a partial index that restricts the status to "pending". The vast majority of rows have "done" status, with occasionally a few thousand rows showing as "pending". The automatic stats can happen to see zero "pending" rows or see a few thousand of them, depending on timing. When stats show zero rows, the partial index is estimated to be empty and becomes an enticing index for the optimizer. In this case, the reliable plan was to use the PK which guaranteed that we scan at most one row.

rytaft commented 3 years ago

Unfortunately it seems like the fix with the cost penalty is not robust enough to handle some variations on this issue. For example, if the alternative plan has cardinality 10, we may not choose it.

This is going to require some more thought, so I'll reopen this issue for now.

mgartner commented 2 years ago

Just saw another variation on this. It's similar to the example from Radu above:

Saw another instance of this in the wild. The use case involved a "status" column and a partial index that restricts the status to "pending". The vast majority of rows have "done" status, with occasionally a few thousand rows showing as "pending". The automatic stats can happen to see zero "pending" rows or see a few thousand of them, depending on timing. When stats show zero rows, the partial index is estimated to be empty and becomes an enticing index for the optimizer. In this case, the reliable plan was to use the PK which guaranteed that we scan at most one row.

However, it's slightly different. There is an updated_at column and a FK-like column other_id. The query fetches recently updated rows with a specific other_id, e.g. WHERE other_id = 1234 AND updated_at > now() - '1 hour'. There's a 12-bucket hash sharded index on updated_at and a regular secondary index on other_id.

The histograms for the updated_at column do not include the most recently updated values, so the optimizer prefers to scan the index on updated_at. It estimates that it will only scan 6 rows, but in reality it scans over 100k.

The better plan would be to scan the index on other_id which is estimated to scan only about 150 rows, and in reality only scans 2 rows.

Unfortunately, cardinality cannot help us here because other_id is not unique.

I've run into problems like this running Postgres in the past. Our solution was to make automated stats collections much more aggressive. If a table gets very large, automatic stats collection is very unlikely to run if it is triggered by some % of rows being mutated.

Imagine a table that gets 100k inserts per day. It's been around for 1000 days so it now has 100m rows. With our default sql.stats.automatic_collection.fraction_stale_rows at 0.2, 25m rows need to be inserted to trigger automatic stats (25m stale rows / 125m total rows = 0.2), which won't be for 250 days. That's 25m rows and 250 days worth of values that are missing from histograms.

To make automatic stats much more aggressive, you can set sql.stats.automatic_collection.fraction_stale_rows to 0 and set sql.stats.automatic_collection.min_stale_rows to some number, say 10k. This ensures that no matter how big the table is, you're guaranteed to collect stats for the most recent rows every so often.

Postgres has the ability adjust these stats knobs at the table level. I don't believe we have that ability yet, but it would be useful for this; a user needs the ability to tune these knobs at a per table level based on the workload.

Here's how to tune auto-stats collection in Postgres for a specific table:

ALTER TABLE t SET (autovacuum_vacuum_scale_factor = 0);
ALTER TABLE t SET (autovacuum_vacuum_threshold = 10000);

mgartner commented 2 years ago

Another idea from Radu is to ignore histograms when building statistics if they show a uniform distribution. Histograms are most helpful if values are skewed. When they are not skewed, using distinct counts, null counts, and row counts might be just as good or better, and it could mitigate the case I last mentioned.

mgartner commented 2 years ago

I think the most robust solution would be to monitor a sample of updates and incrementally update histograms to account for newly added values since the last stats collection. I found a paper on the topic: Fast Incremental Maintenance of Approximate Histograms.

The only approach used to date for histogram updates, which is followed in nearly all commercial systems, is to recompute histograms periodically (e.g., every night). This approach has two disadvantages. First, any significant updates to the data between two recomputations could cause poor estimations in the optimizer. Second, since the histograms are recomputed from scratch by discarding the old histograms, the recomputation phase for the entire database is computationally very intensive.

lin-crl commented 2 years ago

To extend @mgartner 's idea, I wonder if we could fit the histograms to a data distribution ( normal or if we go extra miles for logistic/poisson for example ) and from there use the distribution ( mean/std for normal distribution) to estimate the row count? There're classical statistics measure/methods we could borrow to fit a distribution.

mgartner commented 2 years ago

We already have an issue to track per-table automatic statistics collection: https://github.com/cockroachdb/cockroach/issues/40989

mgartner commented 2 years ago

We ran into this with another customer, very similar to the example I posted above.

One idea from the customer was to introduce a "max age" setting for statistics which would guarantee that stats would run once the previous stats hit this max age, ignoring the other settings sql.stats.automatic_collection.fraction_stale_row and sql.stats.automatic_collection.min_stale_rows.

cc @vy-ton for visibility

bladefist commented 2 years ago

As the customer mentioned in the original post, max age would be good. Also having a command to force refresh the automatic statistics (without having to make new ones) would be great too. Other databases have this as well.

mgartner commented 2 years ago

Also having a command to force refresh the automatic statistics (without having to make new ones) would be great too. Other databases have this as well.

This is possible with ANALYZE table;. Or is the behavior different than what you are describing?

bladefist commented 2 years ago

@mgartner Didn't know about this. From the documentation I am not clear what this is doing. Is this updating statistics or creating new ones?

Also, I ran it and it failed.

batch timestamp 1628283892.315459871,0 must be after replica GC threshold 1628283892.326056946,0

RaduBerinde commented 2 years ago

@bladefist there is no difference. We always create new statistics, there is no separate mechanism to update existing statistics.

Do you have a very low TTL on that table? ANALYZE by default reads as of timestamp -30seconds (I believe). The smallest supported TTL is on the order of minutes.

You can also run CREATE STATISTICS directly, the end result is the same.

bladefist commented 2 years ago

900 seconds. I'll increase it. I'm guessing creating statistics is taking longer than the TTL of the table.

RaduBerinde commented 2 years ago

What version are you running? A while ago we made it so that the read timestamp increases if it takes a long time.

bladefist commented 2 years ago

@RaduBerinde 21.1.6

RaduBerinde commented 2 years ago

Can you file a separate issue for that error? I think we may only be using that mechanism when stats are triggerred automatically, but we should do it for ANALYZE too.

bladefist commented 2 years ago

@RaduBerinde #68590 and #68591

rail commented 2 years ago

Manually synced with Jira

mgartner commented 2 years ago

I've created https://github.com/cockroachdb/cockroach/issues/83431 specifically for the ascending key problem. There are other problems described in this issue that won't necessarily be fixed by fixing the ascending key problem, like this one.

michae2 commented 2 years ago

https://github.com/cockroachdb/cockroach/issues/84461 is in a similar vein.

michae2 commented 1 year ago

Consider the hypothesis that unreliable stats usually produce estimated row counts of exactly zero. If this were true, simply treating zero-row estimates as a special case might improve things. For example, if we got a zero-row estimate using histograms, then we could recalculate the estimate with histograms disabled (as mentioned above).

So tonight I attempted to use telemetry data to prove or disprove this hypothesis. Using a rough criteria for query-executions-with-unreliable-stats, I grouped all such executions from the past month by log-base-2 of estimated rows:

estimated rows	percent of query-executions-with-unreliable-stats	percent of distinct-queries-with-unreliable-stats
0	36.8%	7.4%
1	1.9%	3.9%
2 to 3	4.6%	1.5%
4 to 7	4.0%	1.9%
8 to 15	0.8%	6.4%
16 to 31	0.9%	1.9%
32 to 63	13.9%	2.1%
64 to 127	17.7%	3.9%
128 to 255	1.3%	4.0%
256 to 511	1.1%	5.4%
512 to 1023	2.4%	8.4%
1024 to 2047	2.2%	12.1%
2028 to 4095	1.5%	11.2%
4096 to 8191	1.8%	10.0%
8192 to 16383	1.4%	12.0%
16384 to 32767	1.1%	17.2%
32768 to 65535	0.9%	15.6%
65536 to 131071	2.7%	9.0%
131072 to 262143	0.5%	4.0%

(rest of results omitted)

An estimate of zero rows is a strong signal, but special casing this alone won't fix the general problem.

rytaft commented 1 year ago

Nice analysis! Out of curiosity, what was the definition you used for "unreliable" stats?

michae2 commented 1 year ago

Nice analysis! Out of curiosity, what was the definition you used for "unreliable" stats?

I was trying to copy https://github.com/cockroachdb/cockroach/blob/6b6a8fb68fcdb008e5be095d5aa82d9462eaf6a1/pkg/sql/opt/exec/explain/emit.go#L473-L476

so the conditions I used were:

...
AND STATS_AVAILABLE
AND TOTAL_SCAN_ROWS_ESTIMATE IS NOT NULL
AND ROWS_READ IS NOT NULL
AND ROWS_READ > 2048
AND ((ROWS_READ * 2 + 100) < TOTAL_SCAN_ROWS_ESTIMATE OR (TOTAL_SCAN_ROWS_ESTIMATE * 2 + 100) < ROWS_READ)
...

mgartner commented 1 year ago

I just saw another case of this which is very similar to the one described in my comment above. The gist of it is:

CREATE TABLE t (
  k INT PRIMARY KEY,
  a STRING,
  b STRING,
  c TIMESTAMP,
  UNIQUE INEX (a, b),
  INDEX (a, c)
)

SELECT * FROM t WHERE a = 'value outside max value in histogram' AND b = 'foo';

There are many rows in the table where a = 'value outside max value in histogram', but the histograms don't reflect that because they are not up-to-date.

The optimizer considers two plans. A fast one:

index-join t
 ├── columns: k a b c
 └── scan order@t_a_b_idx
      ├── columns: k a b
      ├── constraint: /a/b: [/'value outside max value in histogram'/'foo' - /'value outside max value in histogram'/'foo']
      └── stats: [rows=1]

And a slow one:

select
 ├── columns: k a b c
 ├── index-join t
 │    ├── columns: k a b c
 │    └── scan order@t_a_c_idx
 │         ├── columns: k a c
 │         ├── constraint: /a/c: [/'value outside max value in histogram' - /'value outside max value in histogram']
 │         └── stats: [rows=0.147284]
 └── filters
      └── b = 'foo'

The index-joins dominate the cost for these plans, so the slow one is picked because performing an index join on 0.147284 rows is cheaper than performing an index join on 1 row.

mgartner commented 1 year ago

It seems like we run into more trouble when we underestimate the row count than when we overestimate it.

Shouldn't it be safer to assume that filtering by a value outside of the histogram range produces some rows rather than zero rows? What if we consider the ranges (-infinity - min) and (max - infinity) to be "buckets" that have the same distinct_range and num_range as the other buckets in the histogram? I believe that would cover a lot of these cases in this ticket.

mgartner commented 1 year ago

I've noticed that the problem is more likely to present when there are many columns or columns with a high avg_size. The cost of an index join is heavily swayed by the sum of the lookup column avg_sizes. Perhaps this is a clue that we need to lessen the impact of avg_size on costing?

mgartner commented 1 year ago

Shouldn't it be safer to assume that filtering by a value outside of the histogram range produces some rows rather than zero rows? What if we consider the ranges (-infinity - min) and (max - infinity) to be "buckets" that have the same distinct_range and num_range as the other buckets in the histogram? I believe that would cover a lot of these cases in this ticket.

Here's a POC that shows this idea: https://github.com/cockroachdb/cockroach/pull/97999

A distinct advantage to this over something like incremental stats collection is that it always will work. Incremental stats can be collected more frequently, but they'll still be stale some of the time, and slow query plans of this form are possible. This approach is also far simpler to implement.

So what are the downsides? In theory, over-estimating the number of matching rows outside of the histogram bounds could lead to slower query plans in some cases. I need to spend more time trying to concoct an example that proves such a regression is possible. I'm (maybe over optimistically) doubtful that this will actually happen in practice. Like I mentioned above, it seems that under-estimating row counts is a far greater problem than over-estimating them.

lin-crl commented 1 year ago

Hi how do we estimate row counts for auto-incremental columns like int as part of PK? For example PK is (col1 int, col2, int). I just opened a support case on behalf of customer and support may open an escalation later. Since col1 is auto-incremental, a new value in col1 will always be out of col1 boundary. I wonder if we should use average row counts for col1, instead of 0 in this case. Thank you!

michae2 commented 1 year ago

Hi how do we estimate row counts for auto-incremental columns like int as part of PK? For example PK is (col1 int, col2, int). I just opened a support case on behalf of customer and support may open an escalation later. Since col1 is auto-incremental, a new value in col1 will always be out of col1 boundary. I wonder if we should use average row counts for col1, instead of 0 in this case. Thank you!

To help with auto-increment columns, and other columns with predictable histogram changes, we added statistics forecasting in v22.2 (as described in #79872).

(If the problem is on v22.1, one way to check whether forecasting will help is to capture a statement bundle, and then use cockroach debug statement-bundle recreate with a v22.2 binary, and EXPLAIN the query again to see if forecasting is used.)

lin-crl commented 1 year ago

This is great to hear! @michae2 we'll validate it

vbabenkoru commented 1 year ago

@michae2 Statistics forecasting doesn't work for this unfortunately, if one of the columns doesn't increase consistently.

lin-crl commented 1 year ago

@vbabenkoru Could you share a bit more on how the column is used? Perhaps on slack channel so we can discuss specifics on your use case?

vbabenkoru commented 1 year ago

@mgartner's proposal would work in our case I believe, since forecasting isn't always possible or reliable, and the default behavior of assuming 0 rows whenever it encounters a new value is the main issue here.

mgartner commented 1 year ago

@vbabenkoru You mean the proposal here, correct? I have a prototype of it that I'm hoping to find time to finish soon.

itsbilal commented 2 months ago

Ran into this on the DRT drt-chaos cluster, so adding O-testcluster. Internal conversation

ajstorm commented 2 months ago

Hit it on drt-large too, which resulted in this duplicate issue.

cockroachdb / cockroach

opt: avoid choosing index with unreliable stats #64570