cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.59k stars 3.7k forks source link

sql: loose index scan #24584

Open godofdream opened 6 years ago

godofdream commented 6 years ago

Mysql supports loose index scans and in postgres they can be simulated by using a recursive CTE. See: https://dev.mysql.com/doc/refman/5.7/en/group-by-optimization.html

One usecase example:

select distinct on (a) a,timestamp,c from mytable ORDER BY (a, timestamp DESC, c);

Index over (a,timestamp DESC,c) a and timestamp are not unique but c is. Normally a full-tablescan is used but a loose index scan would be benefitable

Importance: Nice-To-Have performance enhancement

Jira issue: CRDB-5753

jordanlewis commented 6 years ago

For future readers, a loose index scan skips over an index using multiple seeks. For example, take the following schema and table:

CREATE TABLE t (a int, b int, c string, PRIMARY KEY(a, b));
SELECT DISTINCT(a) FROM t;

That SELECT currently does a full table scan. If there are a ton of different b values per value of a, that full table scan will be less efficient than doing seeks to a+1 after every fresh value of a that's found.

petermattis commented 6 years ago

To see benefit, the "loose index scan" would need to be implemented down at the level of the KV Scan operation. @andreimatei this is an example of additional intelligence (beyond filtering, projection and aggergation) that we may want to add to the Scan op.

jordanlewis commented 5 years ago

I don't agree with @petermattis that this would need to be implemented at the kv layer to see benefit.

Imagine a table that has 3 columns, a b c. There are 10,000,000 values of b for every value of a, but just 2 values of a.

In this case, today, a query like

SELECT * from t WHERE b = 2000

would need to scan 20,000,000 rows, which works out to be a lot of roundtrips to the kv layer.

If we implemented a loose index scan without kv support, we'd need only 5 roundtrips:

  1. Search for [minKey of a, 2000]. This will return the first value in the table, or a match if a contains minKey. If there's a match, we return the row, increment a, and try again. If there's no match, we save the first value of a, and move on to step 2.
  2. Assuming there's no match, we search for [first value of a, 2000]. If there's a match, we return all rows that match a, 2000.
  3. Then, regardless of whether there's a match in step 2, we increment a and search again.

Assuming the table has a values 0 and 10, here's what the request flow would look like:

  1. search for [minKey, 2000]. get back [0, 0] and 1 batch of rows. Assume that the batch size was smaller than 2000.
  2. search for [0, 2000]. Get back a match, return it.
  3. search for [1, 2000]. Get back [10, 0] and 1 batch of rows.
  4. search for [10, 2000]. Get back a match, return it.
  5. search for [11, 2000]. Get back end of table. End.
petermattis commented 5 years ago

@jordanlewis The approach you describe here is interesting. With per-column histograms, or perhaps even just distinct counts, I think the optimizer could determine when this type of index scan is beneficial. Pushing support into the KV layer would still be desirable at some point, as I think this type of scan could be used in more situations (i.e. when the win isn't as clear cut as your scenario).

awoods187 commented 5 years ago

Linking to https://github.com/cockroachdb/cockroach/issues/38082

andreimatei commented 5 years ago

Oracle calls this "index skip scans"

https://gerardnico.com/db/oracle/index_scans#index_skip_scans

rohany commented 5 years ago

implemented in #38216

andreimatei commented 4 years ago

This appears to have been closed prematurely. Fixing it needs #39668, which is stalled at the moment.

mgartner commented 3 years ago

A loose index scan can also be beneficial in a simpler case than the one described by @jordanlewis above—querying for distinct values.

CREATE TABLE t (a INT, INDEX (a));
SELECT DISTINCT(a) FROM t;

If there are, for example, 1000 unique values of a and there are 1,000,000 rows in the table, you can use a similar method as described above to avoid a full table scan. This should be beneficial whenever the ratio of distinct values to number of rows is low.

Example:

Assume a has values 0, 100, 200, 300, 400, 500...

  1. Search for the minimum key in [minKey, maxKey], batch size of 1. Get back 0.
  2. Search for the minimum key in (0, maxKey], batch size of 1. Get back 100.
  3. Search for the minimum key in (100, maxKey], batch size of 1. Get back 200.
  4. Search for the minimum key in (200, maxKey], batch size of 1. Get back 300. ... n. When you get nothing back, you've found all unique values of a.

Postgres doesn't have loose index scans, but you can mimic a loose index scan using a recursive CTE: https://malisper.me/the-missing-postgres-scan-the-loose-index-scan

mgartner commented 2 years ago

Here's another type of query that would benefit from loose index scans. Consider the table and query:

CREATE TABLE t (a INT, b INT, c INT, INDEX a_b_c_idx (a, b, c));
SELECT * FROM t WHERE a = 1 AND c = 3;

Currently, we would scan all KVs in a_b_c_idx where a=1, then filter on c=3. With a loose index scan approach, we don't need to scan all KVs and apply a filter. We can skip over repeated values of b where c!=3. This could be more efficient in cases where a = 1 AND c = 3 is highly selective and the cardinality of b for rows that satisfy that filter is low.

github-actions[bot] commented 6 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!