ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
36.77k stars 6.8k forks source link

Vector similarity index does not work as expected. #69085

Open ucasfl opened 2 weeks ago

ucasfl commented 2 weeks ago

First, the test case in #68678.

CREATE TABLE tab_f32
(
    `id` Int32,
    `vec` Array(Float32),
    INDEX idx vec TYPE vector_similarity('hnsw', 'L2Distance', 'f32', 0, 0, 0) GRANULARITY 2
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 3

Query id: c7e3f76b-be3e-4413-8428-c4bceb55a7c9

Ok.

0 rows in set. Elapsed: 0.002 sec. 

INSERT INTO tab_f32 VALUES (0, [4.6, 2.3]), (1, [2.0, 3.2]), (2, [4.2, 3.4]), (3, [5.3, 2.9]), (4, [2.4, 5.2]), (5, [5.3, 2.3]), (6, [1.0, 9.3]), (7, [5.5, 4.7]), (8, [6.4, 3.5]), (9, [5.3, 2.5]), (10, [6.4, 3.4]), (11, [6.4, 3.2]);

INSERT INTO tab_f32 FORMAT Values

Query id: e54f42d1-c0eb-4cfc-b1c8-e1a6c8ab0f45

Ok.

12 rows in set. Elapsed: 0.002 sec.

WITH [0., 2.] AS reference_vec
SELECT
    id,
    vec,
    L2Distance(vec, reference_vec)
FROM tab_f32
ORDER BY L2Distance(vec, reference_vec) ASC
LIMIT 3

Query id: 4df52400-817d-4aff-a100-cc3b9d43a5d6

   ┌─id─┬─vec───────┬─L2Distance(vec, reference_vec)─┐
1. │  1 │ [2,3.2]   │             2.3323807824711897 │
2. │  4 │ [2.4,5.2] │             3.9999999046325727 │
3. │  2 │ [4.2,3.4] │              4.427188573446585 │
   └────┴───────────┴────────────────────────────────┘

3 rows in set. Elapsed: 0.002 sec.

EXPLAIN indexes = 1
WITH [0., 2.] AS reference_vec
SELECT
    id,
    vec,
    L2Distance(vec, reference_vec)
FROM tab_f32
ORDER BY L2Distance(vec, reference_vec) ASC
LIMIT 3

Query id: 44cfdc13-25f6-46ed-8c05-93dc0fe968ca

    ┌─explain─────────────────────────────────────────────────────────────────────────────────────────┐
 1. │ Expression (Project names)                                                                      │
 2. │   Limit (preliminary LIMIT (without OFFSET))                                                    │
 3. │     Sorting (Sorting for ORDER BY)                                                              │
 4. │       Expression ((Before ORDER BY + (Projection + Change column names to column identifiers))) │
 5. │         ReadFromMergeTree (ch_wxg_weolap.tab_f32)                                               │
 6. │         Indexes:                                                                                │
 7. │           PrimaryKey                                                                            │
 8. │             Condition: true                                                                     │
 9. │             Parts: 1/1                                                                          │
10. │             Granules: 4/4                                                                       │
    └─────────────────────────────────────────────────────────────────────────────────────────────────┘

10 rows in set. Elapsed: 0.001 sec.

The results is not same as the reference in #68678.

Second, another case use our real dataset( ~ 25 millions X 768 dim):

CREATE TABLE dwd_me5_embedding
(
    `day_` Date COMMENT '分区字段',
    `xxx` String,
    `xxx` String,
    `xxx` String,
    `xxx` String,
    `me5_embedding` Array(Float32),
)
ENGINE = MergeTree
PARTITION BY day_
PRIMARY KEY tuple()
ORDER BY tuple()
SETTINGS index_granularity = 8192;

alter table dwd_me5_embedding add INDEX idx me5_embedding TYPE vector_similarity('hnsw', 'cosineDistance', 'f16', 16, 128, 64);

When I use such a SQL to query it, got results:

WITH (
        SELECT me5_embedding
        FROM dwd_me5_embedding
        LIMIT 1
    ) AS query_vector
SELECT cosineDistance(me5_embedding, query_vector) AS distance
FROM dwd_me5_embedding
ORDER BY distance ASC
LIMIT 5

Query id: a3a2ff4a-043a-4fef-bdf2-3e5206a9eab4

   ┌──────distance─┐
1. │ -1.1920929e-7 │
2. │ -1.1920929e-7 │
3. │ -1.1920929e-7 │
4. │ -1.1920929e-7 │
5. │ -1.1920929e-7 │
   └───────────────┘

5 rows in set. Elapsed: 6.373 sec. Processed 24.11 million rows, 74.26 GB (3.78 million rows/s., 11.65 GB/s.)
Peak memory usage: 1.02 GiB.

EXPLAIN indexes = 1
WITH (
        SELECT me5_embedding
        FROM dwd_me5_embedding
        LIMIT 1
    ) AS query_vector
SELECT cosineDistance(me5_embedding, query_vector) AS distance
FROM dwd_me5_embedding
ORDER BY distance ASC
LIMIT 5

Query id: 17cafccd-d359-4958-b246-e0bea9a1e233

    ┌─explain─────────────────────────────────────────────────────────────────────────────────────────┐
 1. │ Expression (Project names)                                                                      │
 2. │   Limit (preliminary LIMIT (without OFFSET))                                                    │
 3. │     Sorting (Sorting for ORDER BY)                                                              │
 4. │       Expression ((Before ORDER BY + (Projection + Change column names to column identifiers))) │
 5. │         ReadFromMergeTree (ch_wxg_weolap.dwd_me5_embedding)                            │
 6. │         Indexes:                                                                                │
 7. │           MinMax                                                                                │
 8. │             Condition: true                                                                     │
 9. │             Parts: 16/16                                                                        │
10. │             Granules: 7141/7141                                                                 │
11. │           Partition                                                                             │
12. │             Condition: true                                                                     │
13. │             Parts: 16/16                                                                        │
14. │             Granules: 7141/7141                                                                 │
    └─────────────────────────────────────────────────────────────────────────────────────────────────┘

14 rows in set. Elapsed: 0.005 sec. 

Looks like it does not use the vector-similarity index and why the distance is negtive? Is anything wrong of my usage? cc @rschu1ze

rschu1ze commented 2 weeks ago

I did not check in detail (I am on a business trip) but did you try to disable analyzer before querying the vector similarity index? Otherwise, the system will not be able to use the index (at least as of now).

ucasfl commented 2 weeks ago

I did not check in detail (I am on a business trip) but did you try to disable analyzer before querying the vector similarity index? Otherwise, the system will not be able to use the index (at least as of now).

EXPLAIN indexes = 1
WITH [0., 2.] AS reference_vec
SELECT
    id,
    vec,
    L2Distance(vec, reference_vec)
FROM tab_f32
ORDER BY L2Distance(vec, reference_vec) ASC
LIMIT 3
SETTINGS allow_experimental_analyzer = 0

Query id: 1ea9bb62-4dad-468f-9a73-24fb6135ca58

    ┌─explain──────────────────────────────────────────────────┐
 1. │ Expression (Projection)                                  │
 2. │   Limit (preliminary LIMIT (without OFFSET))             │
 3. │     Sorting (Sorting for ORDER BY)                       │
 4. │       Expression (Before ORDER BY)                       │
 5. │         ReadFromMergeTree (ch_wxg_weolap.tab_f32)        │
 6. │         Indexes:                                         │
 7. │           PrimaryKey                                     │
 8. │             Condition: true                              │
 9. │             Parts: 1/1                                   │
10. │             Granules: 4/4                                │
11. │           Skip                                           │
12. │             Name: idx                                    │
13. │             Description: vector_similarity GRANULARITY 2 │
14. │             Parts: 1/1                                   │
15. │             Granules: 2/4                                │
    └──────────────────────────────────────────────────────────┘

15 rows in set. Elapsed: 0.002 sec. 

The test case worked as expected.

But the last case still does not work:

EXPLAIN indexes = 1
WITH (
        SELECT me5_embedding
        FROM dwd_me5_embedding
        LIMIT 1
    ) AS query_vector
SELECT cosineDistance(me5_embedding, query_vector) AS distance
FROM dwd_me5_embedding
ORDER BY distance ASC
LIMIT 5
SETTINGS allow_experimental_analyzer = 0

Query id: fca2c1e4-8ec3-40a0-a051-334a85096a97

    ┌─explain──────────────────────────────────────────────────────────────┐
 1. │ Expression (Projection)                                              │
 2. │   Limit (preliminary LIMIT (without OFFSET))                         │
 3. │     Sorting (Sorting for ORDER BY)                                   │
 4. │       Expression (Before ORDER BY)                                   │
 5. │         ReadFromMergeTree (dwd_me5_embedding) │
 6. │         Indexes:                                                     │
 7. │           MinMax                                                     │
 8. │             Condition: true                                          │
 9. │             Parts: 16/16                                             │
10. │             Granules: 7141/7141                                      │
11. │           Partition                                                  │
12. │             Condition: true                                          │
13. │             Parts: 16/16                                             │
14. │             Granules: 7141/7141                                      │
    └──────────────────────────────────────────────────────────────────────┘

14 rows in set. Elapsed: 0.017 sec. 
rschu1ze commented 2 weeks ago

The distance measure (e.g. cosineDistance, L2Distance) specified during index creation must be the same as the one used during querying. Is that the case in the second example?

ucasfl commented 2 weeks ago

The distance measure (e.g. cosineDistance, L2Distance) specified during index creation must be the same as the one used during querying. Is that the case in the second example?

Yes, cosineDistance used in both index creation and querying.

ucasfl commented 2 weeks ago

Create table and insert data:

CREATE TABLE ch_wxg_weolap.tab_f16
(
    `id` Int32,
    `vec` Array(Float32),
    INDEX idx vec TYPE vector_similarity('hnsw', 'cosineDistance', 'f16', 0, 0, 0) GRANULARITY 2
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 3

Works:

EXPLAIN indexes = 1
WITH [0., 2.] AS reference_vec
SELECT
    id,
    vec,
    cosineDistance(vec, reference_vec) AS distance
FROM tab_f16
ORDER BY distance ASC
LIMIT 3
SETTINGS allow_experimental_analyzer = 0

Query id: e768bf2b-b5f5-4d6c-9df7-2403cd393cae

    ┌─explain──────────────────────────────────────────────────┐
 1. │ Expression (Projection)                                  │
 2. │   Limit (preliminary LIMIT (without OFFSET))             │
 3. │     Sorting (Sorting for ORDER BY)                       │
 4. │       Expression (Before ORDER BY)                       │
 5. │         ReadFromMergeTree (ch_wxg_weolap.tab_f16)        │
 6. │         Indexes:                                         │
 7. │           PrimaryKey                                     │
 8. │             Condition: true                              │
 9. │             Parts: 1/1                                   │
10. │             Granules: 4/4                                │
11. │           Skip                                           │
12. │             Name: idx                                    │
13. │             Description: vector_similarity GRANULARITY 2 │
14. │             Parts: 1/1                                   │
15. │             Granules: 2/4                                │
    └──────────────────────────────────────────────────────────┘

15 rows in set. Elapsed: 0.002 sec.

Does not work:

EXPLAIN indexes = 1
WITH (
        SELECT vec
        FROM tab_f16
        LIMIT 1
    ) AS reference_vec
SELECT
    id,
    vec,
    cosineDistance(vec, reference_vec) AS distance
FROM tab_f16
ORDER BY distance ASC
LIMIT 3
SETTINGS allow_experimental_analyzer = 0

Query id: 2413bded-86bd-4ed4-ac71-6432b8cb93d5

    ┌─explain───────────────────────────────────────────┐
 1. │ Expression (Projection)                           │
 2. │   Limit (preliminary LIMIT (without OFFSET))      │
 3. │     Sorting (Sorting for ORDER BY)                │
 4. │       Expression (Before ORDER BY)                │
 5. │         ReadFromMergeTree (ch_wxg_weolap.tab_f16) │
 6. │         Indexes:                                  │
 7. │           PrimaryKey                              │
 8. │             Condition: true                       │
 9. │             Parts: 1/1                            │
10. │             Granules: 4/4                         │
    └───────────────────────────────────────────────────┘

10 rows in set. Elapsed: 0.002 sec.

Still does not work:

EXPLAIN indexes = 1
WITH (
        SELECT [0., 2.]
    ) AS reference_vec
SELECT
    id,
    vec,
    cosineDistance(vec, reference_vec) AS distance
FROM tab_f16
ORDER BY distance ASC
LIMIT 3
SETTINGS allow_experimental_analyzer = 0

Query id: 912801a9-c898-473f-ac5e-0ef5bc9d11b2

    ┌─explain───────────────────────────────────────────┐
 1. │ Expression (Projection)                           │
 2. │   Limit (preliminary LIMIT (without OFFSET))      │
 3. │     Sorting (Sorting for ORDER BY)                │
 4. │       Expression (Before ORDER BY)                │
 5. │         ReadFromMergeTree (ch_wxg_weolap.tab_f16) │
 6. │         Indexes:                                  │
 7. │           PrimaryKey                              │
 8. │             Condition: true                       │
 9. │             Parts: 1/1                            │
10. │             Granules: 4/4                         │
    └───────────────────────────────────────────────────┘

10 rows in set. Elapsed: 0.002 sec. 

@rschu1ze

rschu1ze commented 2 weeks ago

@ucasfl Nice find, thanks. The logic that matches queries with supported ANN queries is buggy. The thing is that the logic needs to be rewritten anyways to support the analyzer. For now, I created a test case to not forget about this.

ucasfl commented 2 weeks ago

The issue should not be closed?