opt: poor index recommendations for queries filtering non-indexed columns

When queries have some filters on non-indexed columns, the optimizer can produce poor index recommendations. We collect only 2-bucket histograms for non-indexed columns. With such a limited perspective of the distribution of values of non-indexed columns, the optimizer can wildly underestimate row counts as it explores possible index recommendations. This can cause the optimizer to make poor recommendations.

Here's an annotated reproduction of the problem:

CREATE TABLE ab (
  a INT PRIMARY KEY,
  b INT,
  INDEX (b)
);

CREATE TABLE cda (
  c INT PRIMARY KEY,
  d INT,
  a INT,
  INDEX (a)
);

-- Insert 1000 rows into ab with 100 distinct values of b, evenly distributed.
INSERT INTO ab SELECT i, i % 100 FROM generate_series(1, 1000) g(i);

-- Insert 10,000 rows into cda with 10,000 distinct values of d.
INSERT INTO cda SELECT i, i, i % 100 FROM generate_series(1, 10000) g(i);

-- Insert 1000 rows into cds with 1 distinct value of d, 33.
INSERT INTO cda SELECT i, 33, i % 100 FROM generate_series(10001, 11000) g(i);

ANALYZE ab;
ANALYZE cda;

EXPLAIN
SELECT * FROM ab
JOIN cda ON ab.a = cda.a
WHERE b = 10 AND d = 33;
--                                           info
-- ----------------------------------------------------------------------------------------
--   distribution: local
--   vectorized: true
--
--   • lookup join
--   │ estimated row count: 1
--   │ table: cda@cda_pkey
--   │ equality: (c) = (c)
--   │ equality cols are key
--   │ pred: d = 33
--   │
--   └── • lookup join
--       │ estimated row count: 1,100
--       │ table: cda@cda_a_idx
--       │ equality: (a) = (a)
--       │
--       └── • scan
--             estimated row count: 10 (1.0% of the table; stats collected 0 seconds ago)
--             table: ab@ab_b_idx
--             spans: [/10 - /10]
--
--   index recommendations: 1
--   1. type: index creation
--      SQL command: CREATE INDEX ON defaultdb.public.cda (d) STORING (a);
-- (23 rows)

-- Create the recommended index.
CREATE INDEX ON cda (d) STORING (a);

-- The new index is used. The optimizer estimates a row count of 1.
-- However, this plan is actually worse than the previous plan.
EXPLAIN
SELECT * FROM ab
JOIN cda ON ab.a = cda.a
WHERE b = 10 AND d = 33;
--                                         info
-- ------------------------------------------------------------------------------------
--   distribution: local
--   vectorized: true
--
--   • lookup join
--   │ estimated row count: 1
--   │ table: ab@ab_pkey
--   │ equality: (a) = (a)
--   │ equality cols are key
--   │ pred: b = 10
--   │
--   └── • scan
--         estimated row count: 1 (0.01% of the table; stats collected 0 seconds ago)
--         table: cda@cda_d_idx
--         spans: [/33 - /33]
-- (14 rows)

-- The row count is actually 1001. The estimate is far off because the index
-- recommendation was based on a 2-bucket histogram, which is the size of
-- histograms collected for non-indexed columns.
SELECT count(*) FROM cda WHERE d = 33;
--   count
-- ---------
--    1001
-- (1 row)

Jira issue: CRDB-39902

cockroachdb / cockroach

opt: poor index recommendations for queries filtering non-indexed columns #126418