Open ucasfl opened 2 weeks ago
I did not check in detail (I am on a business trip) but did you try to disable analyzer before querying the vector similarity index? Otherwise, the system will not be able to use the index (at least as of now).
I did not check in detail (I am on a business trip) but did you try to disable analyzer before querying the vector similarity index? Otherwise, the system will not be able to use the index (at least as of now).
EXPLAIN indexes = 1
WITH [0., 2.] AS reference_vec
SELECT
id,
vec,
L2Distance(vec, reference_vec)
FROM tab_f32
ORDER BY L2Distance(vec, reference_vec) ASC
LIMIT 3
SETTINGS allow_experimental_analyzer = 0
Query id: 1ea9bb62-4dad-468f-9a73-24fb6135ca58
┌─explain──────────────────────────────────────────────────┐
1. │ Expression (Projection) │
2. │ Limit (preliminary LIMIT (without OFFSET)) │
3. │ Sorting (Sorting for ORDER BY) │
4. │ Expression (Before ORDER BY) │
5. │ ReadFromMergeTree (ch_wxg_weolap.tab_f32) │
6. │ Indexes: │
7. │ PrimaryKey │
8. │ Condition: true │
9. │ Parts: 1/1 │
10. │ Granules: 4/4 │
11. │ Skip │
12. │ Name: idx │
13. │ Description: vector_similarity GRANULARITY 2 │
14. │ Parts: 1/1 │
15. │ Granules: 2/4 │
└──────────────────────────────────────────────────────────┘
15 rows in set. Elapsed: 0.002 sec.
The test case worked as expected.
But the last case still does not work:
EXPLAIN indexes = 1
WITH (
SELECT me5_embedding
FROM dwd_me5_embedding
LIMIT 1
) AS query_vector
SELECT cosineDistance(me5_embedding, query_vector) AS distance
FROM dwd_me5_embedding
ORDER BY distance ASC
LIMIT 5
SETTINGS allow_experimental_analyzer = 0
Query id: fca2c1e4-8ec3-40a0-a051-334a85096a97
┌─explain──────────────────────────────────────────────────────────────┐
1. │ Expression (Projection) │
2. │ Limit (preliminary LIMIT (without OFFSET)) │
3. │ Sorting (Sorting for ORDER BY) │
4. │ Expression (Before ORDER BY) │
5. │ ReadFromMergeTree (dwd_me5_embedding) │
6. │ Indexes: │
7. │ MinMax │
8. │ Condition: true │
9. │ Parts: 16/16 │
10. │ Granules: 7141/7141 │
11. │ Partition │
12. │ Condition: true │
13. │ Parts: 16/16 │
14. │ Granules: 7141/7141 │
└──────────────────────────────────────────────────────────────────────┘
14 rows in set. Elapsed: 0.017 sec.
The distance measure (e.g. cosineDistance, L2Distance) specified during index creation must be the same as the one used during querying. Is that the case in the second example?
The distance measure (e.g. cosineDistance, L2Distance) specified during index creation must be the same as the one used during querying. Is that the case in the second example?
Yes, cosineDistance
used in both index creation and querying.
Create table and insert data:
CREATE TABLE ch_wxg_weolap.tab_f16
(
`id` Int32,
`vec` Array(Float32),
INDEX idx vec TYPE vector_similarity('hnsw', 'cosineDistance', 'f16', 0, 0, 0) GRANULARITY 2
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 3
Works:
EXPLAIN indexes = 1
WITH [0., 2.] AS reference_vec
SELECT
id,
vec,
cosineDistance(vec, reference_vec) AS distance
FROM tab_f16
ORDER BY distance ASC
LIMIT 3
SETTINGS allow_experimental_analyzer = 0
Query id: e768bf2b-b5f5-4d6c-9df7-2403cd393cae
┌─explain──────────────────────────────────────────────────┐
1. │ Expression (Projection) │
2. │ Limit (preliminary LIMIT (without OFFSET)) │
3. │ Sorting (Sorting for ORDER BY) │
4. │ Expression (Before ORDER BY) │
5. │ ReadFromMergeTree (ch_wxg_weolap.tab_f16) │
6. │ Indexes: │
7. │ PrimaryKey │
8. │ Condition: true │
9. │ Parts: 1/1 │
10. │ Granules: 4/4 │
11. │ Skip │
12. │ Name: idx │
13. │ Description: vector_similarity GRANULARITY 2 │
14. │ Parts: 1/1 │
15. │ Granules: 2/4 │
└──────────────────────────────────────────────────────────┘
15 rows in set. Elapsed: 0.002 sec.
Does not work:
EXPLAIN indexes = 1
WITH (
SELECT vec
FROM tab_f16
LIMIT 1
) AS reference_vec
SELECT
id,
vec,
cosineDistance(vec, reference_vec) AS distance
FROM tab_f16
ORDER BY distance ASC
LIMIT 3
SETTINGS allow_experimental_analyzer = 0
Query id: 2413bded-86bd-4ed4-ac71-6432b8cb93d5
┌─explain───────────────────────────────────────────┐
1. │ Expression (Projection) │
2. │ Limit (preliminary LIMIT (without OFFSET)) │
3. │ Sorting (Sorting for ORDER BY) │
4. │ Expression (Before ORDER BY) │
5. │ ReadFromMergeTree (ch_wxg_weolap.tab_f16) │
6. │ Indexes: │
7. │ PrimaryKey │
8. │ Condition: true │
9. │ Parts: 1/1 │
10. │ Granules: 4/4 │
└───────────────────────────────────────────────────┘
10 rows in set. Elapsed: 0.002 sec.
Still does not work:
EXPLAIN indexes = 1
WITH (
SELECT [0., 2.]
) AS reference_vec
SELECT
id,
vec,
cosineDistance(vec, reference_vec) AS distance
FROM tab_f16
ORDER BY distance ASC
LIMIT 3
SETTINGS allow_experimental_analyzer = 0
Query id: 912801a9-c898-473f-ac5e-0ef5bc9d11b2
┌─explain───────────────────────────────────────────┐
1. │ Expression (Projection) │
2. │ Limit (preliminary LIMIT (without OFFSET)) │
3. │ Sorting (Sorting for ORDER BY) │
4. │ Expression (Before ORDER BY) │
5. │ ReadFromMergeTree (ch_wxg_weolap.tab_f16) │
6. │ Indexes: │
7. │ PrimaryKey │
8. │ Condition: true │
9. │ Parts: 1/1 │
10. │ Granules: 4/4 │
└───────────────────────────────────────────────────┘
10 rows in set. Elapsed: 0.002 sec.
@rschu1ze
@ucasfl Nice find, thanks. The logic that matches queries with supported ANN queries is buggy. The thing is that the logic needs to be rewritten anyways to support the analyzer. For now, I created a test case to not forget about this.
The issue should not be closed?
First, the test case in #68678.
The results is not same as the reference in #68678.
Second, another case use our real dataset( ~ 25 millions X 768 dim):
When I use such a SQL to query it, got results:
Looks like it does not use the vector-similarity index and why the distance is negtive? Is anything wrong of my usage? cc @rschu1ze