cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

sql: support INVERTED INDEX range scans #24960

Open danhhz opened 6 years ago

danhhz commented 6 years ago

From the forum: https://forum.cockroachlabs.com/t/multi-tenant-custom-fields-saas-app/1565/2

So, let’s say all tenants’ products are in one table and there’s foreign key tenant_id. Then we have a json field, custom_data. A tenant might have a custom field price. Then the tenant wants to search all his products where price > 100. An index on the foreign key, possibly compound with other “static” fields will speed up the query. But an index on json field will not be useful in this case, right?

Jira issue: CRDB-5739

danhhz commented 6 years ago

Assigning to @awoods187 for prioritization

awoods187 commented 4 years ago

@RaduBerinde is this related to the computed index ideas we've been discussing?

RaduBerinde commented 4 years ago

@RaduBerinde is this related to the computed index ideas we've been discussing?

I think so, we could have an index on a computed value that extracts the json field.

jordanlewis commented 4 years ago

A similar use case is for prefix matches on inverted index string columns:

create table a (id int primary key, s string[], inverted index(s));
insert into a values(1, array['big', 'indexed', 'sentence']);

Today, you can do exact matches:

select * from a where s @> array['blah']

With some way to do range scans, you could do something like (syntax doesn't work, but this is the intent):

select * from a where any(s) like 'prefix%'
lopezator commented 4 years ago

We also got bitten by this, a filter over an JSON column with an inverted index uses it, but if you add another field to the game, a PK for example, it actually ignores the inverted index.

Thanks for raising this @danhhz

DROP table foo;
CREATE TABLE foo (A INT PRIMARY KEY, B jsonb, C VARCHAR);
INSERT INTO foo (A, B, C) SELECT generate_series(1,100) AS A, '{"values": ["foo", "bar", "baz"]}' AS B, md5(random()::text) AS C;
CREATE INVERTED INDEX foo_inv ON foo(B);
CREATE INDEX foo_idx ON foo(C);
EXPLAIN SELECT * FROM foo WHERE B @> '{"values": ["baz"]}';

Looks good:

[
  {
    "tree": "",
    "field": "distributed",
    "description": "false"
  },
  {
    "tree": "",
    "field": "vectorized",
    "description": "false"
  },
  {
    "tree": "index-join",
    "field": "",
    "description": ""
  },
  {
    "tree": " │",
    "field": "table",
    "description": "foo@primary"
  },
  {
    "tree": " │",
    "field": "key columns",
    "description": "a"
  },
  {
    "tree": " └── scan",
    "field": "",
    "description": ""
  },
  {
    "tree": "",
    "field": "table",
    "description": "foo@foo_inv"
  },
  {
    "tree": "",
    "field": "spans",
    "description": "/\"values\"/Arr/\"baz\"-/\"values\"/Arr/\"baz\"/PrefixEnd"
  }
]

But:

EXPLAIN SELECT * FROM foo WHERE  B @> '{"values": ["baz"]}' AND C = 'someAutogenID';

Doesn't:

[
  {
    "tree": "",
    "field": "distributed",
    "description": "false"
  },
  {
    "tree": "",
    "field": "vectorized",
    "description": "false"
  },
  {
    "tree": "filter",
    "field": "",
    "description": ""
  },
  {
    "tree": " │",
    "field": "filter",
    "description": "b @> '{\"values\": [\"baz\"]}'"
  },
  {
    "tree": " └── index-join",
    "field": "",
    "description": ""
  },
  {
    "tree": "      │",
    "field": "table",
    "description": "foo@primary"
  },
  {
    "tree": "      │",
    "field": "key columns",
    "description": "a"
  },
  {
    "tree": "      └── scan",
    "field": "",
    "description": ""
  },
  {
    "tree": "",
    "field": "table",
    "description": "foo@foo_idx"
  },
  {
    "tree": "",
    "field": "spans",
    "description": "/\"78455d02293f0f16ab5e519c244a70dc\"-/\"78455d02293f0f16ab5e519c244a70dc\"/PrefixEnd"
  }
]
RaduBerinde commented 4 years ago

@lopezator - in the second case, it's much better to use the primary index since we scan at most one row (for a=3). Using the inverted index would be worse in most cases.

lopezator commented 4 years ago

@RaduBerinde you are right, bad example. I've update the example above to be more clear.

RaduBerinde commented 3 years ago

CC @mgartner @rytaft

mgartner commented 3 years ago

@lopezator Your updated example is scanning foo_idx, which is the better query plan assuming that the filters on C is more selective than the filter on B. Currently, stats for JSON columns are not as precise as stats for other data types, so it's possible that a scan on foo_idx would be preferred even if the filter on B was more selective.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

Bessonov commented 1 year ago

a comment will keep it active