cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.2k stars 3.82k forks source link

sql: inverted (trigram) indexes on the same column as the primary key are not allowed #84405

Open jordanlewis opened 2 years ago

jordanlewis commented 2 years ago

EDIT: #84412 fixed the brokenness described below by disallowing trigram indexes on primary key columns altogether.

Inverted indexes on the same column as the primary key column in a table are broken. This is only possible in the (unreleased) trigram index functionality, because all other inverted indexes are only available on types that are not forward indexable (arrays, json, and geo types do not support ordinary forward indexes).

Consider the following case:

demo@127.0.0.1:26257/defaultdb> create table a (a primary key) as values('1000'),('1001');

demo@127.0.0.1:26257/defaultdb> create inverted index on a (a gin_trgm_ops);
ERROR: internal error: validation of index a_a_idx failed: expected 10 rows, found 7
SQLSTATE: XX000
DETAIL: stack trace:
github.com/cockroachdb/cockroach/pkg/sql/backfill.go:1556: func1()
github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:169: func1()
golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:74: func1()
GOROOT/src/runtime/asm_amd64.s:1581: goexit()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.

We appreciate your feedback.
demo@127.0.0.1:26257/defaultdb>

This brokenness is due to the fact that index descriptors must keep track of the "key suffix columns" to create a proper index key: these are the columns that are the indexed columns of an index, minus any columns that are also in the primary key.

In inverted indexes on primary keys, this means that the "key suffix column" list will always be empty: since we're indexing a column that's already in the PK, the code thinks there is no need to append the PK back again to the index key.

However, this is a flawed assumption in the case of inverted indexes, because inverted indexes don't store the full copy of the values that are being indexed, only a portion.

The fix is to change the "key suffix column" calculation logic to not subtract a PK column from the list if it's the inverted column in an inverted index.

Jira issue: CRDB-17637

jordanlewis commented 2 years ago

cc @mgartner as an FYI. I've added this to the schema board for tracking purposes (I think that would be correct), but I've assigned myself and will send a fix.

jordanlewis commented 2 years ago

I created a patch that permits correct index creation and validation, but it produces a variety problems at query time, including errors and incorrect outputs. I am wondering whether it's because the optimizer isn't able to deal with seeing the same column twice in a single row, especially if one of the appearances of the column is the specially handled opaque bytes of an inverted column:

demo@127.0.0.1:26257/defaultdb> create table a (b TEXT PRIMARY KEY, INVERTED INDEX(b gin_trgm_ops));
CREATE TABLE

Time: 9ms total (execution 9ms / network 0ms)

demo@127.0.0.1:26257/defaultdb> insert into a values('abcd');                                                                                                                                             INSERT 0 1

Time: 7ms total (execution 6ms / network 0ms)

demo@127.0.0.1:26257/defaultdb> explain(opt,verbose) select * from a where b like '%bcd%';                                                                                                                                                                              info
---------------------------------------------------------------------------------------------------------------------------------
  index-join a
   ├── columns: b:1
   ├── stats: [rows=333.3333, distinct(1)=333.333, null(1)=0, avgsize(1)=4]
   ├── cost: 957.773704
   ├── key: (1)
   ├── distribution: us-east1
   └── select
        ├── columns: b:1
        ├── stats: [rows=37.03704, distinct(1)=37.037, null(1)=0, avgsize(1)=4]
        ├── cost: 132.938889
        ├── key: (1)
        ├── distribution: us-east1
        ├── scan a@a_b_idx
        │    ├── columns: b:1
        │    ├── inverted constraint: /4/1
        │    │    └── spans: ["\x12bcd\x00\x01", "\x12bcd\x00\x01"]
        │    ├── stats: [rows=111.1111, distinct(1)=111.111, null(1)=0, avgsize(1)=4, distinct(4)=100, null(4)=0, avgsize(4)=4]
        │    ├── cost: 131.797778
        │    ├── key: (1)
        │    └── distribution: us-east1
        └── filters
             └── b:1 LIKE '%bcd%' [outer=(1), constraints=(/1: (/NULL - ])]
(22 rows)

Time: 1ms total (execution 1ms / network 0ms)

demo@127.0.0.1:26257/defaultdb> select * from a where b like '%bcd%';                                                                                                                                       b
-----
(0 rows)

Time: 2ms total (execution 2ms / network 0ms)

-- After some time (stats changes?):

demo@127.0.0.1:26257/defaultdb> select * from a where b like '%bcd%';
ERROR: unsupported comparison: string to bytes
SQLSTATE: 42804
demo@127.0.0.1:26257/defaultdb> explain(opt,verbose) select * from a where b like '%bcd%';
ERROR: unsupported comparison: string to bytes
SQLSTATE: 42804

@mgartner, do you have thoughts on this behavior? Is the optimizer equipped to deal with seeing a column id twice, once as an inverted column and once as a forward column?

I'm thinking that the easiest and best path forward for now would probably be to completely disable the ability to have an inverted index on a primary key column, which seems like a somewhat marginal use case in the first place. And, this general issue won't be a problem in any other scenarios because inverted indexes cannot have storing columns, which is the only other way that an inverted index could have a second (forward) copy of the column that it's inverted indexing.

mgartner commented 5 months ago

There's a reasonable and creative (I'm pretty proud of it, at least 😉) workaround for this: create a virtual computed column that is a duplicate of the PK column you want to inverted-index. The optimizer will be able to recognize the equality of the two columns and it should plan constrained scans and joins on the inverted index even when filtering by the PK column.

Here's an example:

CREATE TABLE t (
  s STRING PRIMARY KEY,
  indexed_s STRING AS (s) VIRTUAL,
  INVERTED INDEX (indexed_s gin_trgm_ops)
);

-- A query filtering by "s" can utilize the inverted index on "indexed_s".
EXPLAIN SELECT * FROM t WHERE s LIKE '%foobar%';
--                       info
-- -------------------------------------------------
--   distribution: local
--   vectorized: true
--
--   • render
--   │
--   └── • inverted filter
--       │ inverted column: indexed_s_inverted_key
--       │ num spans: 4
--       │
--       └── • filter
--           │ filter: s LIKE '%foobar%'
--           │
--           └── • scan
--                 missing stats
--                 table: t@t_indexed_s_idx
--                 spans: 4 spans
-- (16 rows)
mgartner commented 5 months ago

Maybe this workaround hints at an acceptable (even best?) solution: automatically create an inaccessible, virtual computed column for users trying to inverted-index a PK column. Would that be easier than changing the behavior of "key suffix columns"?