opt: limit hint row estimate can be wildly wrong if values are not evenly distributed

michae2 commented 2 years ago

As of 22.1, the optimizer's limit hint calculation for selects assumes that the selected rows are evenly distributed amongst the input rows: https://github.com/cockroachdb/cockroach/blob/d20fd55bab4c3ff08336831d42c457a36a1acdec/pkg/sql/opt/xform/physical_props.go#L164-L168

This is the best we can do with the information we have, but it can be wildly wrong if the selected rows are not evenly distributed. If the selected rows come at the end of the input then we could read many more rows than the limit hint suggests. We've seen a few cases in the wild of scans reading 15x - 20x more rows than the limit hint suggests.

Here's a demonstration:

CREATE TABLE t (a int NOT NULL, b bool NOT NULL);
INSERT INTO t SELECT generate_series(0, 9999), false;
INSERT INTO t SELECT generate_series(10000, 19999), false;
INSERT INTO t SELECT generate_series(20000, 29999), false;
INSERT INTO t SELECT generate_series(30000, 39999), false;
INSERT INTO t SELECT generate_series(40000, 49999), false;
INSERT INTO t SELECT generate_series(50000, 59999), false;
INSERT INTO t SELECT generate_series(60000, 69999), false;
INSERT INTO t SELECT generate_series(70000, 79999), false;
INSERT INTO t SELECT generate_series(80000, 89999), false;
INSERT INTO t SELECT generate_series(90000, 99999), true; -- selected rows at end of input
CREATE INDEX ON t (a) STORING (b);
CREATE INDEX ON t (b) STORING (a);
ANALYZE t;

-- The optimizer expects to find 20 rows with b = true after reading 200 rows from t@t_idx_a,
-- assuming b = true is evenly distributed, but it actually must read at least 90020 rows.
EXPLAIN SELECT * FROM t WHERE b ORDER BY a LIMIT 20;
EXPLAIN (OPT, VERBOSE) SELECT * FROM t WHERE b ORDER BY a LIMIT 20;
EXPLAIN ANALYZE SELECT * FROM t WHERE b ORDER BY a LIMIT 20;

-- When we force the index on b, we read 10000 rows, which is about ~5x to ~9x faster.
EXPLAIN SELECT * FROM t@t_b_idx WHERE b ORDER BY a LIMIT 20;
EXPLAIN (OPT, VERBOSE) SELECT * FROM t@t_b_idx WHERE b ORDER BY a LIMIT 20;
EXPLAIN ANALYZE SELECT * FROM t@t_b_idx WHERE b ORDER BY a LIMIT 20;

Unfortunately we don't really have enough information to cost this correctly. If the selected rows are at the beginning of the input then the plan is indeed good. Here's the same demonstration but with selected rows at the beginning:

CREATE TABLE t2 (a int NOT NULL, b bool NOT NULL);
INSERT INTO t2 SELECT generate_series(0, 9999), true; -- selected rows at beginning of input
INSERT INTO t2 SELECT generate_series(10000, 19999), false;
INSERT INTO t2 SELECT generate_series(20000, 29999), false;
INSERT INTO t2 SELECT generate_series(30000, 39999), false;
INSERT INTO t2 SELECT generate_series(40000, 49999), false;
INSERT INTO t2 SELECT generate_series(50000, 59999), false;
INSERT INTO t2 SELECT generate_series(60000, 69999), false;
INSERT INTO t2 SELECT generate_series(70000, 79999), false;
INSERT INTO t2 SELECT generate_series(80000, 89999), false;
INSERT INTO t2 SELECT generate_series(90000, 99999), false;
CREATE INDEX ON t2 (a) STORING (b);
CREATE INDEX ON t2 (b) STORING (a);
ANALYZE t2;

EXPLAIN ANALYZE SELECT * FROM t2 WHERE b ORDER BY a LIMIT 20;

And, in, fact, if the selected rows are distributed entirely at the beginning, the limit hint can be wrong in the other direction (too pessimistic). This one is a little harder to demonstrate, we need a bigger table and a smaller proportion of selected rows:

CREATE TABLE t3 (a int NOT NULL, b bool NOT NULL);
INSERT INTO t3 SELECT generate_series(0, 3999), true; -- selected rows at beginning of input
INSERT INTO t3 SELECT generate_series(4000, 19999), false;
INSERT INTO t3 SELECT generate_series(20000, 29999), false;
INSERT INTO t3 SELECT generate_series(30000, 39999), false;
INSERT INTO t3 SELECT generate_series(40000, 49999), false;
INSERT INTO t3 SELECT generate_series(50000, 59999), false;
INSERT INTO t3 SELECT generate_series(60000, 69999), false;
INSERT INTO t3 SELECT generate_series(70000, 79999), false;
INSERT INTO t3 SELECT generate_series(80000, 89999), false;
INSERT INTO t3 SELECT generate_series(90000, 99999), false;
INSERT INTO t3 SELECT generate_series(100000, 109999), false;
INSERT INTO t3 SELECT generate_series(110000, 119999), false;
INSERT INTO t3 SELECT generate_series(120000, 129999), false;
INSERT INTO t3 SELECT generate_series(130000, 139999), false;
INSERT INTO t3 SELECT generate_series(140000, 149999), false;
INSERT INTO t3 SELECT generate_series(150000, 159999), false;
INSERT INTO t3 SELECT generate_series(160000, 169999), false;
INSERT INTO t3 SELECT generate_series(170000, 179999), false;
INSERT INTO t3 SELECT generate_series(180000, 189999), false;
INSERT INTO t3 SELECT generate_series(190000, 199999), false;
INSERT INTO t3 SELECT generate_series(200000, 209999), false;
INSERT INTO t3 SELECT generate_series(210000, 219999), false;
INSERT INTO t3 SELECT generate_series(220000, 229999), false;
INSERT INTO t3 SELECT generate_series(230000, 239999), false;
INSERT INTO t3 SELECT generate_series(240000, 249999), false;
INSERT INTO t3 SELECT generate_series(250000, 259999), false;
INSERT INTO t3 SELECT generate_series(260000, 269999), false;
INSERT INTO t3 SELECT generate_series(270000, 279999), false;
INSERT INTO t3 SELECT generate_series(280000, 289999), false;
INSERT INTO t3 SELECT generate_series(290000, 299999), false;
INSERT INTO t3 SELECT generate_series(300000, 309999), false;
INSERT INTO t3 SELECT generate_series(310000, 319999), false;
INSERT INTO t3 SELECT generate_series(320000, 329999), false;
INSERT INTO t3 SELECT generate_series(330000, 339999), false;
INSERT INTO t3 SELECT generate_series(340000, 349999), false;
INSERT INTO t3 SELECT generate_series(350000, 359999), false;
INSERT INTO t3 SELECT generate_series(360000, 369999), false;
INSERT INTO t3 SELECT generate_series(370000, 379999), false;
INSERT INTO t3 SELECT generate_series(380000, 389999), false;
INSERT INTO t3 SELECT generate_series(390000, 399999), false;
INSERT INTO t3 SELECT generate_series(400000, 409999), false;
INSERT INTO t3 SELECT generate_series(410000, 419999), false;
INSERT INTO t3 SELECT generate_series(420000, 429999), false;
INSERT INTO t3 SELECT generate_series(430000, 439999), false;
INSERT INTO t3 SELECT generate_series(440000, 449999), false;
INSERT INTO t3 SELECT generate_series(450000, 459999), false;
INSERT INTO t3 SELECT generate_series(460000, 469999), false;
INSERT INTO t3 SELECT generate_series(470000, 479999), false;
INSERT INTO t3 SELECT generate_series(480000, 489999), false;
INSERT INTO t3 SELECT generate_series(490000, 499999), false;
INSERT INTO t3 SELECT generate_series(500000, 509999), false;
INSERT INTO t3 SELECT generate_series(510000, 519999), false;
INSERT INTO t3 SELECT generate_series(520000, 529999), false;
INSERT INTO t3 SELECT generate_series(530000, 539999), false;
INSERT INTO t3 SELECT generate_series(540000, 549999), false;
INSERT INTO t3 SELECT generate_series(550000, 559999), false;
INSERT INTO t3 SELECT generate_series(560000, 569999), false;
INSERT INTO t3 SELECT generate_series(570000, 579999), false;
INSERT INTO t3 SELECT generate_series(580000, 589999), false;
INSERT INTO t3 SELECT generate_series(590000, 599999), false;
INSERT INTO t3 SELECT generate_series(600000, 609999), false;
INSERT INTO t3 SELECT generate_series(610000, 619999), false;
INSERT INTO t3 SELECT generate_series(620000, 629999), false;
INSERT INTO t3 SELECT generate_series(630000, 639999), false;
INSERT INTO t3 SELECT generate_series(640000, 649999), false;
INSERT INTO t3 SELECT generate_series(650000, 659999), false;
INSERT INTO t3 SELECT generate_series(660000, 669999), false;
INSERT INTO t3 SELECT generate_series(670000, 679999), false;
INSERT INTO t3 SELECT generate_series(680000, 689999), false;
INSERT INTO t3 SELECT generate_series(690000, 699999), false;
INSERT INTO t3 SELECT generate_series(700000, 709999), false;
INSERT INTO t3 SELECT generate_series(710000, 719999), false;
INSERT INTO t3 SELECT generate_series(720000, 729999), false;
INSERT INTO t3 SELECT generate_series(730000, 739999), false;
INSERT INTO t3 SELECT generate_series(740000, 749999), false;
INSERT INTO t3 SELECT generate_series(750000, 759999), false;
INSERT INTO t3 SELECT generate_series(760000, 769999), false;
INSERT INTO t3 SELECT generate_series(770000, 779999), false;
INSERT INTO t3 SELECT generate_series(780000, 789999), false;
INSERT INTO t3 SELECT generate_series(790000, 799999), false;
INSERT INTO t3 SELECT generate_series(800000, 809999), false;
INSERT INTO t3 SELECT generate_series(810000, 819999), false;
INSERT INTO t3 SELECT generate_series(820000, 829999), false;
INSERT INTO t3 SELECT generate_series(830000, 839999), false;
INSERT INTO t3 SELECT generate_series(840000, 849999), false;
INSERT INTO t3 SELECT generate_series(850000, 859999), false;
INSERT INTO t3 SELECT generate_series(860000, 869999), false;
INSERT INTO t3 SELECT generate_series(870000, 879999), false;
INSERT INTO t3 SELECT generate_series(880000, 889999), false;
INSERT INTO t3 SELECT generate_series(890000, 899999), false;
INSERT INTO t3 SELECT generate_series(900000, 909999), false;
INSERT INTO t3 SELECT generate_series(910000, 919999), false;
INSERT INTO t3 SELECT generate_series(920000, 929999), false;
INSERT INTO t3 SELECT generate_series(930000, 939999), false;
INSERT INTO t3 SELECT generate_series(940000, 949999), false;
INSERT INTO t3 SELECT generate_series(950000, 959999), false;
INSERT INTO t3 SELECT generate_series(960000, 969999), false;
INSERT INTO t3 SELECT generate_series(970000, 979999), false;
INSERT INTO t3 SELECT generate_series(980000, 989999), false;
INSERT INTO t3 SELECT generate_series(990000, 999999), false;
CREATE INDEX ON t3 (a) STORING (b);
CREATE INDEX ON t3 (b) STORING (a);
ANALYZE t3;

-- The optimizer expects to read 5000 rows from t3_a_idx to find 20 matching rows, but only 4000 from t3_b_idx,
-- so it chooses t3_b_idx.
EXPLAIN ANALYZE SELECT * FROM t3 WHERE b ORDER BY a LIMIT 20;

-- Really it only needs to read 20 rows from t3_a_idx. (Or 1024 if the vectorized engine is batching.)
EXPLAIN ANALYZE SELECT * FROM t3@t3_a_idx WHERE b ORDER BY a LIMIT 20;

So it's not clear that a one-size-fits-all approach of pessimizing limit hints would be the right thing to do.

Maybe multicolumn histograms could help some of these cases.

Jira issue: CRDB-17671

msirek commented 1 year ago

Another problem case of this is in https://github.com/cockroachlabs/support/issues/2197 No rows are qualified for the predicate, so the lookup join limit hint ends up being too small. Idea: If there is a filter on a lookup join, don't adjust the rowcount down so much in computeLookupJoinCost because that filter would be cheaper to apply as a constrained scan using another index.

https://github.com/cockroachdb/cockroach/blob/d2594fc6354ec1d3b70a9d29cb24cde513901235/pkg/sql/opt/xform/coster.go#L1160-L1171

mgartner commented 1 year ago

I've come across a case in the wild where the limit hint is 1000x off from the actual number of rows needed to scan. The problem appears worst when we apply the generic 1/3rd selectivity to filters that we don't have more accurate information about. If filter is actually more selective than 1/3, then we'll come up with a wildly low limit hint and end up with a poor plan.

Maybe the generic filter selectivity should be used for row counts, but not for limit hints. In other words, when we don't have a good idea about the selectivity of the filter, why not assume that all rows will be produced in order to satisfy the limit?

msirek commented 1 year ago

Another case of this is https://github.com/cockroachlabs/support/issues/2447 Except in that case we weren't using default filter selectivity. If we could mark each selectivity estimate with a confidence level, then "no confidence" estimates could use a selectivity of 1 for limit hint costing and for "low confidence", maybe use something higher than the original selectivity estimate, but not 1.

mgartner commented 7 months ago

I have a theory that we are simply overly optimistic when costing with a limit hint. For example, in a scan, we take the limit hint as the row count. This assumes that there is an even distribution of rows across the index that will satisfy the limit in a parent expression. This seems like a best-case scenario in most cases, and wildly optimistic.

Should we instead consider the row count to be the average of the limit hint and the worst case of scanning the entire table? Alternatively, we could use the average of the limit that produces the limit hint, the limit hint, and the number of rows in the table.

This is similar to the idea of having a confidence interval of row counts or costs, but would not require any additional machinery.

mgartner commented 7 months ago

Disregard my last comment. Any change to make the limit hint -> row count estimation more or less conservative will benefit certain edge cases but be a detriment to others. I don't see a good solution here besides trying to refine costing to more accurately reflect real-world execution time (not sure if there is even room for improvement here) or speeding up the execution time of the bad plans so that they are less bad when chosen.

mgartner commented 6 months ago

We've seen a handful of cases where the limit hint is wildly wrong because the selectivity of a filter cannot be accurately determined and defaults to 1/3, a wild over-estimation. I've outlined some improvements we can make to our selectivity estimates in #119966.

cockroachdb / cockroach

opt: limit hint row estimate can be wildly wrong if values are not evenly distributed #84461