sql: Similarity queries (%) opt to not use GIN indexes based on query parameters

Describe the problem

A simple SELECT * FROM table WHERE column % 'word' query will ignore the trigram index for certain phrases.

To Reproduce

I have a newly created single node CockroachDB instance and have indexed the USDA Food Dataset food names after manually tokenizing and stemming the strings.

root@localhost:26257/defaultdb> SHOW CREATE TABLE words;
  table_name |                              create_statement
-------------+------------------------------------------------------------------------------
  words      | CREATE TABLE public.words (
             |     food_id UUID NOT NULL,
             |     word STRING NOT NULL,
             |     count INT8 NOT NULL,
             |     weight INT8 NOT NULL,
             |     rowid INT8 NOT VISIBLE NOT NULL DEFAULT unique_rowid(),
             |     CONSTRAINT words_pkey PRIMARY KEY (rowid ASC),
             |     INVERTED INDEX words_food_id_word_idx (food_id ASC, word gin_trgm_ops),
             |     INDEX words_food_id_idx (food_id ASC),
             |     INVERTED INDEX words_word_idx (word gin_trgm_ops)
             | )
(1 row)

When experimenting with various queries, I noticed that certain words will not use the trigram indexes:

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'egg';
                                                                 info
---------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'egg'
  │
  └── • index join
      │ estimated row count: 0
      │ table: words@words_pkey
      │
      └── • inverted filter
          │ estimated row count: 0
          │ inverted column: word_inverted_key
          │ num spans: 4
          │
          └── • scan
                estimated row count: 0 (<0.01% of the table; stats collected 19 minutes ago; using stats forecast for 18 minutes ago)
                table: words@words_word_idx
                spans: 4 spans
(20 rows)

Time: 3ms total (execution 3ms / network 0ms)

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'cake';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'cake'
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 19 minutes ago; using stats forecast for 18 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)

Time: 3ms total (execution 3ms / network 0ms)

The above queries don't result in dramatically different result counts but the execution times are dramatically different.

root@localhost:26257/defaultdb> SELECT COUNT(*) FROM words WHERE word % 'egg';
  count
---------
   1096
(1 row)

Time: 129ms total (execution 129ms / network 0ms)

root@localhost:26257/defaultdb> SELECT COUNT(*) FROM words WHERE word % 'cake';
  count
---------
   1425
(1 row)

Time: 876ms total (execution 876ms / network 0ms)

Expected behavior I would expect that any % query would result in a query plan that uses the available trigram index.

Additional data / screenshots Just to give you an idea of the trigram distribution and table size:

root@localhost:26257/defaultdb> SELECT COUNT(*) FROM words;
  count
----------
  644580
(1 row)

Time: 140ms total (execution 140ms / network 0ms)

root@localhost:26257/defaultdb> SELECT unnest(show_trgm(word)) as trigram, COUNT(*) FROM words GROUP BY trigram ORDER BY COUNT(*) DESC LIMIT 25;
  trigram | count
----------+--------
    c     | 95313
    s     | 80770
    b     | 52114
    p     | 47186
    f     | 39878
    m     | 39123
   ch     | 34820
  er      | 27228
    w     | 26876
    g     | 25592
    t     | 25329
    o     | 24770
    r     | 22522
    a     | 21489
    d     | 20076
   co     | 19451
    h     | 19332
  ri      | 19137
  an      | 18802
  in      | 17735
    l     | 17033
   sa     | 15627
   cr     | 14996
  che     | 14521
  rea     | 14453

Environment:

CockroachDB version v22.2.0
Server OS: MacOS
Client app cockroach sql

Additional context

Its a bit concerning to me that a simple parameter change, that is likely to be user controlled in the case of trigrams, could dramatically alter the query plan and execution time of queries.

Jira issue: CRDB-22544

A few other things to note:

It seems that "ca" is what triggers the behavior and ch will as well.

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'ca';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'ca'
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 32 minutes ago; using stats forecast for 31 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)

Time: 5ms total (execution 4ms / network 0ms)

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'ch';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'ch'
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 32 minutes ago; using stats forecast for 31 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)

Time: 3ms total (execution 3ms / network 0ms)

Increasing the length of queries using these problematic prefixes doesn't seem to do much either:

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'cheese';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'cheese'
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 33 minutes ago; using stats forecast for 32 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)

Time: 4ms total (execution 4ms / network 0ms)

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'cakes';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'cakes'
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 33 minutes ago; using stats forecast for 32 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)

Time: 4ms total (execution 3ms / network 0ms)

These suboptimal plans will also "infect" otherwise good plans when using an OR clause.

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'eggs';
                                                                 info
---------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: word % 'eggs'
  │
  └── • index join
      │ estimated row count: 0
      │ table: words@words_pkey
      │
      └── • inverted filter
          │ estimated row count: 0
          │ inverted column: word_inverted_key
          │ num spans: 5
          │
          └── • scan
                estimated row count: 0 (<0.01% of the table; stats collected 34 minutes ago; using stats forecast for 33 minutes ago)
                table: words@words_word_idx
                spans: 5 spans
(20 rows)

Time: 3ms total (execution 3ms / network 1ms)

root@localhost:26257/defaultdb> EXPLAIN  SELECT food_id FROM words WHERE word % 'eggs' OR word % 'cakes';
                                                               info
-----------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ estimated row count: 214,860
  │ filter: (word % 'eggs') OR (word % 'cakes')
  │
  └── • scan
        estimated row count: 644,580 (100% of the table; stats collected 34 minutes ago; using stats forecast for 33 minutes ago)
        table: words@words_pkey
        spans: FULL SCAN
(11 rows)

Time: 4ms total (execution 3ms / network 0ms)

It's important to note that the optimizer cannot (and will never, unless there's a bug) plan a query with a that uses a trigram inverted index if the search term contains fewer than 3 letters. So the example queries that you're seeing that don't work (ca and ch) are expected not to work. Are there examples of 2-character queries that do use the inverted index?

As far as the "or" case, there are more trigrams in that query, and the more trigrams the more scans and the more selective the trigrams must be to "win" over the full scan.

You can use EXPLAIN(opt,memo) for enough detail to be able to track down what's going on here in a little bit more detail.

I take back what I said about two-character % queries not being able to use the index. They're able to use it because padding is added.

I think any issue here is related to stats. I also think that unfortunately the stats collection is working fine, because hinting the inverted index for the cake query actually results in worse performance.

There are some execution performance low hanging fruit here though so hopefully we can speed this stuff up a bit regardless of the stats issues.

Huh, I could have sworn that forcing the usage of the index resulted in a notable speed up but after rebuilding the table it's significantly slower.

root@localhost:26257/defaultdb> EXPLAIN ANALYZE SELECT * FROM foods@foods_name_idx WHERE name % 'cake' ORDER BY similarity(name, 'cake') LIMIT 10;
                                                  info
--------------------------------------------------------------------------------------------------------
  planning time: 2ms
  execution time: 12.7s
  distribution: full
  vectorized: true
  rows read from KV: 565,017 (50 MiB, 9 gRPC calls)
  cumulative time spent in KV: 10.9s
  maximum memory usage: 56 MiB
  network usage: 0 B (0 messages)

  • top-k
  │ nodes: n1
  │ actual row count: 10
  │ estimated max memory allocated: 10 KiB
  │ estimated max sql temp disk usage: 0 B
  │ estimated row count: 10
  │ order: +column9
  │ k: 10
  │
  └── • render
      │
      └── • filter
          │ nodes: n1
          │ actual row count: 91
          │ estimated row count: 131,179
          │ filter: name % 'cake'
          │
          └── • index join
              │ nodes: n1
              │ actual row count: 238,051
              │ KV time: 10.8s
              │ KV contention time: 0µs
              │ KV rows read: 238,051
              │ KV bytes read: 34 MiB
              │ KV gRPC calls: 7
              │ estimated max memory allocated: 41 MiB
              │ estimated max sql temp disk usage: 0 B
              │ estimated row count: 296,144
              │ table: foods@foods_pkey
              │
              └── • inverted filter
                  │ nodes: n1
                  │ actual row count: 238,051
                  │ estimated max memory allocated: 15 MiB
                  │ estimated max sql temp disk usage: 0 B
                  │ estimated row count: 296,144
                  │ inverted column: name_inverted_key
                  │ num spans: 5
                  │
                  └── • scan
                        nodes: n1
                        actual row count: 326,966
                        KV time: 104ms
                        KV contention time: 0µs
                        KV rows read: 326,966
                        KV bytes read: 16 MiB
                        KV gRPC calls: 2
                        estimated max memory allocated: 10 MiB
                        estimated row count: 296,144 (75% of the table; stats collected 2 minutes ago)
                        table: foods@foods_name_idx
                        spans: 5 spans
(60 rows)

Time: 12.656s total (execution 12.656s / network 0.000s)

root@localhost:26257/defaultdb> EXPLAIN ANALYZE SELECT * FROM foods WHERE name % 'cake' ORDER BY similarity(name, 'cake') LIMIT 10;
                                              info
-------------------------------------------------------------------------------------------------
  planning time: 2ms
  execution time: 2.5s
  distribution: full
  vectorized: true
  rows read from KV: 393,537 (59 MiB, 6 gRPC calls)
  cumulative time spent in KV: 130ms
  maximum memory usage: 10 MiB
  network usage: 0 B (0 messages)

  • top-k
  │ nodes: n1
  │ actual row count: 10
  │ estimated max memory allocated: 10 KiB
  │ estimated max sql temp disk usage: 0 B
  │ estimated row count: 10
  │ order: +column9
  │ k: 10
  │
  └── • render
      │
      └── • filter
          │ nodes: n1
          │ actual row count: 91
          │ estimated row count: 131,179
          │ filter: name % 'cake'
          │
          └── • scan
                nodes: n1
                actual row count: 393,537
                KV time: 130ms
                KV contention time: 0µs
                KV rows read: 393,537
                KV bytes read: 59 MiB
                KV gRPC calls: 6
                estimated max memory allocated: 10 MiB
                estimated row count: 393,537 (100% of the table; stats collected 2 minutes ago)
                table: foods@foods_pkey
                spans: FULL SCAN
(38 rows)

Time: 2.500s total (execution 2.500s / network 0.000s)

Here's the updated EXPLAIN output:

root@localhost:26257/defaultdb> EXPLAIN(opt,memo) SELECT * FROM foods WHERE name % 'cake' ORDER BY similarity(name, 'cake') LIMIT 10;
                                                                                info
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
  memo (optimized, ~10KB, required=[presentation: info:10])
   ├── G1: (explain G2 [presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9])
   │    └── [presentation: info:10]
   │         ├── best: (explain G2="[presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9]" [presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9])
   │         └── cost: 655458.61
   ├── G2: (limit G3 G4 ordering=+9) (top-k G3 &{10 +9 })
   │    ├── [presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9]
   │    │    ├── best: (top-k G3 &{10 +9 })
   │    │    └── cost: 655458.59
   │    └── []
   │         ├── best: (top-k G3 &{10 +9 })
   │         └── cost: 655458.59
   ├── G3: (project G5 G6 id name analyzed weight)
   │    ├── [ordering: +9] [limit hint: 10.00]
   │    │    ├── best: (sort G3)
   │    │    └── cost: 700593.11
   │    └── []
   │         ├── best: (project G5 G6 id name analyzed weight)
   │         └── cost: 644119.16
   ├── G4: (const 10)
   ├── G5: (select G7 G8) (select G9 G8)
   │    └── []
   │         ├── best: (select G7 G8)
   │         └── cost: 641495.56
   ├── G6: (projections G10)
   ├── G7: (scan foods,cols=(1-4))
   │    └── []
   │         ├── best: (scan foods,cols=(1-4))
   │         └── cost: 637560.16
   ├── G8: (filters G11)
   ├── G9: (index-join G12 foods,cols=(1-4))
   │    └── []
   │         ├── best: (index-join G12 foods,cols=(1-4))
   │         └── cost: 2357336.33
   ├── G10: (function G13 similarity)
   ├── G11: (mod G14 G15)
   ├── G12: (inverted-filter G16 name_inverted_key)
   │    └── []
   │         ├── best: (inverted-filter G16 name_inverted_key)
   │         └── cost: 408708.76
   ├── G13: (scalar-list G14 G15)
   ├── G14: (variable name)
   ├── G15: (const 'cake')
   └── G16: (scan foods@foods_name_idx,cols=(1,7),constrained inverted)
        └── []
             ├── best: (scan foods@foods_name_idx,cols=(1,7),constrained inverted)
             └── cost: 405747.30
  top-k
   ├── k: 10
   └── project
        ├── select
        │    ├── scan foods
        │    └── filters
        │         └── name % 'cake'
        └── projections
             └── similarity(name, 'cake')
(56 rows)

Time: 6ms total (execution 5ms / network 1ms)

I put together a little repo to toy around with various querying methods and it will rebuild the exact database that I've been using while poking around this issue: https://github.com/chrisseto/cockroach-trigrams.

@chrisseto try playing around with a build on this PR: https://github.com/cockroachdb/cockroach/pull/93757

Also note that running with EXPLAIN ANALYZE is much slower than running without EXPLAIN ANALYZE.

https://github.com/cockroachdb/cockroach/pull/93757 shaves about 0.5 seconds off some of my queries! Wow, I did not expect that to be the source of any slowness.

Oddly, the queries against already "analyzed" (Tokenized, stemmed, joined with spaces) only saw a 0.05 second improvement. Overall an amazing change, the results are a bit curious in some cases.

The problem with % vs ILIKE is that it needs to search the boundary trigrams - the ones that have spaces and then a letter, or a letter and than spaces. And those are very high cardinality typically for this dataset because they essentially reduce to a 1-gram. It's a bit of an annoying problem...

I investigated with Postgres, and I'm noticing that they have similar weirdness - depending on the index, a % query may or may not be accelerated with the index due to how expensive searching each of the boundary trigrams is.

Interesting! Thanks for digging into this further. I wonder if we could overload the similarity function to accept some tuning options. It would hurt compatibility a little bit but might result in a better UX? Though I suppose the actual solution here is TSVectors isn't it?

What should we do with this issue? It seems like this is less of a bug and more a very niche use case that may or may not be worth pouring more time into.

I agree, I don't think it's a bug, so I'll close it. I think the answer is tsvectors if you're looking for normal word search as opposed to fuzzy search. But fuzzy search still shouldn't be super slow - then again, I think we're working "as expected" here in terms of performance at least for now.

Hopefully tsvectors are coming... we'll see!

SGTM! Thanks for the improvements and all the debugging.

cockroachdb / cockroach

sql: Similarity queries (%) opt to not use GIN indexes based on query parameters #93830