Closed chrisseto closed 1 year ago
A few other things to note:
It seems that "ca" is what triggers the behavior and ch
will as well.
root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'ca';
info
-----------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: word % 'ca'
│
└── • scan
estimated row count: 644,580 (100% of the table; stats collected 32 minutes ago; using stats forecast for 31 minutes ago)
table: words@words_pkey
spans: FULL SCAN
(11 rows)
Time: 5ms total (execution 4ms / network 0ms)
root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'ch';
info
-----------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: word % 'ch'
│
└── • scan
estimated row count: 644,580 (100% of the table; stats collected 32 minutes ago; using stats forecast for 31 minutes ago)
table: words@words_pkey
spans: FULL SCAN
(11 rows)
Time: 3ms total (execution 3ms / network 0ms)
Increasing the length of queries using these problematic prefixes doesn't seem to do much either:
root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'cheese';
info
-----------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: word % 'cheese'
│
└── • scan
estimated row count: 644,580 (100% of the table; stats collected 33 minutes ago; using stats forecast for 32 minutes ago)
table: words@words_pkey
spans: FULL SCAN
(11 rows)
Time: 4ms total (execution 4ms / network 0ms)
root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'cakes';
info
-----------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: word % 'cakes'
│
└── • scan
estimated row count: 644,580 (100% of the table; stats collected 33 minutes ago; using stats forecast for 32 minutes ago)
table: words@words_pkey
spans: FULL SCAN
(11 rows)
Time: 4ms total (execution 3ms / network 0ms)
These suboptimal plans will also "infect" otherwise good plans when using an OR clause.
root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'eggs';
info
---------------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: word % 'eggs'
│
└── • index join
│ estimated row count: 0
│ table: words@words_pkey
│
└── • inverted filter
│ estimated row count: 0
│ inverted column: word_inverted_key
│ num spans: 5
│
└── • scan
estimated row count: 0 (<0.01% of the table; stats collected 34 minutes ago; using stats forecast for 33 minutes ago)
table: words@words_word_idx
spans: 5 spans
(20 rows)
Time: 3ms total (execution 3ms / network 1ms)
root@localhost:26257/defaultdb> EXPLAIN SELECT food_id FROM words WHERE word % 'eggs' OR word % 'cakes';
info
-----------------------------------------------------------------------------------------------------------------------------------
distribution: local
vectorized: true
• filter
│ estimated row count: 214,860
│ filter: (word % 'eggs') OR (word % 'cakes')
│
└── • scan
estimated row count: 644,580 (100% of the table; stats collected 34 minutes ago; using stats forecast for 33 minutes ago)
table: words@words_pkey
spans: FULL SCAN
(11 rows)
Time: 4ms total (execution 3ms / network 0ms)
It's important to note that the optimizer cannot (and will never, unless there's a bug) plan a query with a that uses a trigram inverted index if the search term contains fewer than 3 letters. So the example queries that you're seeing that don't work (ca and ch) are expected not to work. Are there examples of 2-character queries that do use the inverted index?
As far as the "or" case, there are more trigrams in that query, and the more trigrams the more scans and the more selective the trigrams must be to "win" over the full scan.
You can use EXPLAIN(opt,memo)
for enough detail to be able to track down what's going on here in a little bit more detail.
I take back what I said about two-character % queries not being able to use the index. They're able to use it because padding is added.
I think any issue here is related to stats. I also think that unfortunately the stats collection is working fine, because hinting the inverted index for the cake
query actually results in worse performance.
There are some execution performance low hanging fruit here though so hopefully we can speed this stuff up a bit regardless of the stats issues.
Huh, I could have sworn that forcing the usage of the index resulted in a notable speed up but after rebuilding the table it's significantly slower.
root@localhost:26257/defaultdb> EXPLAIN ANALYZE SELECT * FROM foods@foods_name_idx WHERE name % 'cake' ORDER BY similarity(name, 'cake') LIMIT 10;
info
--------------------------------------------------------------------------------------------------------
planning time: 2ms
execution time: 12.7s
distribution: full
vectorized: true
rows read from KV: 565,017 (50 MiB, 9 gRPC calls)
cumulative time spent in KV: 10.9s
maximum memory usage: 56 MiB
network usage: 0 B (0 messages)
• top-k
│ nodes: n1
│ actual row count: 10
│ estimated max memory allocated: 10 KiB
│ estimated max sql temp disk usage: 0 B
│ estimated row count: 10
│ order: +column9
│ k: 10
│
└── • render
│
└── • filter
│ nodes: n1
│ actual row count: 91
│ estimated row count: 131,179
│ filter: name % 'cake'
│
└── • index join
│ nodes: n1
│ actual row count: 238,051
│ KV time: 10.8s
│ KV contention time: 0µs
│ KV rows read: 238,051
│ KV bytes read: 34 MiB
│ KV gRPC calls: 7
│ estimated max memory allocated: 41 MiB
│ estimated max sql temp disk usage: 0 B
│ estimated row count: 296,144
│ table: foods@foods_pkey
│
└── • inverted filter
│ nodes: n1
│ actual row count: 238,051
│ estimated max memory allocated: 15 MiB
│ estimated max sql temp disk usage: 0 B
│ estimated row count: 296,144
│ inverted column: name_inverted_key
│ num spans: 5
│
└── • scan
nodes: n1
actual row count: 326,966
KV time: 104ms
KV contention time: 0µs
KV rows read: 326,966
KV bytes read: 16 MiB
KV gRPC calls: 2
estimated max memory allocated: 10 MiB
estimated row count: 296,144 (75% of the table; stats collected 2 minutes ago)
table: foods@foods_name_idx
spans: 5 spans
(60 rows)
Time: 12.656s total (execution 12.656s / network 0.000s)
root@localhost:26257/defaultdb> EXPLAIN ANALYZE SELECT * FROM foods WHERE name % 'cake' ORDER BY similarity(name, 'cake') LIMIT 10;
info
-------------------------------------------------------------------------------------------------
planning time: 2ms
execution time: 2.5s
distribution: full
vectorized: true
rows read from KV: 393,537 (59 MiB, 6 gRPC calls)
cumulative time spent in KV: 130ms
maximum memory usage: 10 MiB
network usage: 0 B (0 messages)
• top-k
│ nodes: n1
│ actual row count: 10
│ estimated max memory allocated: 10 KiB
│ estimated max sql temp disk usage: 0 B
│ estimated row count: 10
│ order: +column9
│ k: 10
│
└── • render
│
└── • filter
│ nodes: n1
│ actual row count: 91
│ estimated row count: 131,179
│ filter: name % 'cake'
│
└── • scan
nodes: n1
actual row count: 393,537
KV time: 130ms
KV contention time: 0µs
KV rows read: 393,537
KV bytes read: 59 MiB
KV gRPC calls: 6
estimated max memory allocated: 10 MiB
estimated row count: 393,537 (100% of the table; stats collected 2 minutes ago)
table: foods@foods_pkey
spans: FULL SCAN
(38 rows)
Time: 2.500s total (execution 2.500s / network 0.000s)
Here's the updated EXPLAIN output:
root@localhost:26257/defaultdb> EXPLAIN(opt,memo) SELECT * FROM foods WHERE name % 'cake' ORDER BY similarity(name, 'cake') LIMIT 10;
info
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
memo (optimized, ~10KB, required=[presentation: info:10])
├── G1: (explain G2 [presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9])
│ └── [presentation: info:10]
│ ├── best: (explain G2="[presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9]" [presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9])
│ └── cost: 655458.61
├── G2: (limit G3 G4 ordering=+9) (top-k G3 &{10 +9 })
│ ├── [presentation: id:1,name:2,analyzed:3,weight:4] [ordering: +9]
│ │ ├── best: (top-k G3 &{10 +9 })
│ │ └── cost: 655458.59
│ └── []
│ ├── best: (top-k G3 &{10 +9 })
│ └── cost: 655458.59
├── G3: (project G5 G6 id name analyzed weight)
│ ├── [ordering: +9] [limit hint: 10.00]
│ │ ├── best: (sort G3)
│ │ └── cost: 700593.11
│ └── []
│ ├── best: (project G5 G6 id name analyzed weight)
│ └── cost: 644119.16
├── G4: (const 10)
├── G5: (select G7 G8) (select G9 G8)
│ └── []
│ ├── best: (select G7 G8)
│ └── cost: 641495.56
├── G6: (projections G10)
├── G7: (scan foods,cols=(1-4))
│ └── []
│ ├── best: (scan foods,cols=(1-4))
│ └── cost: 637560.16
├── G8: (filters G11)
├── G9: (index-join G12 foods,cols=(1-4))
│ └── []
│ ├── best: (index-join G12 foods,cols=(1-4))
│ └── cost: 2357336.33
├── G10: (function G13 similarity)
├── G11: (mod G14 G15)
├── G12: (inverted-filter G16 name_inverted_key)
│ └── []
│ ├── best: (inverted-filter G16 name_inverted_key)
│ └── cost: 408708.76
├── G13: (scalar-list G14 G15)
├── G14: (variable name)
├── G15: (const 'cake')
└── G16: (scan foods@foods_name_idx,cols=(1,7),constrained inverted)
└── []
├── best: (scan foods@foods_name_idx,cols=(1,7),constrained inverted)
└── cost: 405747.30
top-k
├── k: 10
└── project
├── select
│ ├── scan foods
│ └── filters
│ └── name % 'cake'
└── projections
└── similarity(name, 'cake')
(56 rows)
Time: 6ms total (execution 5ms / network 1ms)
I put together a little repo to toy around with various querying methods and it will rebuild the exact database that I've been using while poking around this issue: https://github.com/chrisseto/cockroach-trigrams.
@chrisseto try playing around with a build on this PR: https://github.com/cockroachdb/cockroach/pull/93757
Also note that running with EXPLAIN ANALYZE
is much slower than running without EXPLAIN ANALYZE
.
https://github.com/cockroachdb/cockroach/pull/93757 shaves about 0.5 seconds off some of my queries! Wow, I did not expect that to be the source of any slowness.
Oddly, the queries against already "analyzed" (Tokenized, stemmed, joined with spaces) only saw a 0.05 second improvement. Overall an amazing change, the results are a bit curious in some cases.
The problem with %
vs ILIKE
is that it needs to search the boundary trigrams - the ones that have spaces and then a letter, or a letter and than spaces. And those are very high cardinality typically for this dataset because they essentially reduce to a 1-gram. It's a bit of an annoying problem...
I investigated with Postgres, and I'm noticing that they have similar weirdness - depending on the index, a %
query may or may not be accelerated with the index due to how expensive searching each of the boundary trigrams is.
Interesting! Thanks for digging into this further. I wonder if we could overload the similarity
function to accept some tuning options. It would hurt compatibility a little bit but might result in a better UX? Though I suppose the actual solution here is TSVectors isn't it?
What should we do with this issue? It seems like this is less of a bug and more a very niche use case that may or may not be worth pouring more time into.
I agree, I don't think it's a bug, so I'll close it. I think the answer is tsvectors if you're looking for normal word search as opposed to fuzzy search. But fuzzy search still shouldn't be super slow - then again, I think we're working "as expected" here in terms of performance at least for now.
Hopefully tsvectors are coming... we'll see!
SGTM! Thanks for the improvements and all the debugging.
Describe the problem
A simple
SELECT * FROM table WHERE column % 'word'
query will ignore the trigram index for certain phrases.To Reproduce
I have a newly created single node CockroachDB instance and have indexed the USDA Food Dataset food names after manually tokenizing and stemming the strings.
When experimenting with various queries, I noticed that certain words will not use the trigram indexes:
The above queries don't result in dramatically different result counts but the execution times are dramatically different.
Expected behavior I would expect that any
%
query would result in a query plan that uses the available trigram index.Additional data / screenshots Just to give you an idea of the trigram distribution and table size:
Environment:
cockroach sql
Additional context
Its a bit concerning to me that a simple parameter change, that is likely to be user controlled in the case of trigrams, could dramatically alter the query plan and execution time of queries.
Jira issue: CRDB-22544