duckdb / duckdb_vss

MIT License
98 stars 10 forks source link

extension does not seem to do hybrid queries between HNSW index and typical duckdb scalar where clauses. #35

Open malcolm-smith-mck opened 4 weeks ago

malcolm-smith-mck commented 4 weeks ago

For this example (like the basic readme example but I added a varchar column so that I could test if the extension could filter by a scalar column and also process results with array distance and/or using HNSW indexes. I noted that the usearch library has support for push down filters but the vss extensions does not seem to support this operation.

These tests failed, the extension does not seem to support hybrid scalar and vector filters. This feature is very important to my key use cases which require role based security access to GenAI RAG vector stores - e.g. As a user in business unit 1 I can see only RAG chunks that are part of business unit 1 processed documents.

`DROP TABLE IF EXISTS my_vector_table; CREATE TABLE my_vector_table (vec FLOAT[3],cluster_name varchar);

INSERT INTO my_vector_table (vec) SELECT array_value(a,b,c) FROM range(0,100) ra(a), range(0,100) rb(b), range(0,100) rc(c); -- 1M records update my_vector_table set cluster_name = (random()*10+100)::int; -- assigns fake cluster names 101, 102,103 with ~100K record each

CREATE INDEX my_hnsw_index ON my_vector_table USING HNSW (vec); CREATE INDEX my_cluster_name_idx on my_vector_table(cluster_name);

SELECT * FROM my_vector_table where cluster_name = '101' ORDER BY array_distance(vec, [1,2,3]::FLOAT[3]) limit 3; -- does not use HNSW index only uses scalar index -- PROBLEM 1

select from (SELECT FROM my_vector_table where cluster_name = '101' ORDER BY array_distance(vec, [1,2,3]::FLOAT[3]) limit 1000) as tt limit 3; -- this uses the HNSW index but requires the inner limit to be set high enough to happen to include some rows with cluster_name = '101' -- PROBLEM 2 `

PROBLEM 1 - the query fails to any results, does use the HNSW index but seems to apply the plan incorrectly ┌─────────────────────────────┐ │┌───────────────────────────┐│ ││ Physical Plan ││ │└───────────────────────────┘│ └─────────────────────────────┘ ┌───────────────────────────┐ │ PROJECTION │ │ ──────────────────── │ │ #0 │ │ #1 │ │ │ │ ~3 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ vec │ │ cluster_name │ │ array_distance(vec, [1.0, │ │ 2.0, 3.0]) │ │ │ │ ~200000 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ FILTER │ │ ──────────────────── │ │ ((#0 = '101') AND (#0 IS │ │ NOT NULL)) │ │ │ │ ~1000000 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ HNSW_INDEX_SCAN │ │ ──────────────────── │ │ my_vector_table (HNSW │ │ INDEX SCAN : │ │ my_hnsw_index) │ │ │ │ Projections: │ │ cluster_name │ │ vec │ │ │ │ ~1000000 Rows │ └───────────────────────────┘

PROBLEM 2 - no combination of temporary tables, CTE's or other SQL tactics will cause the engine to first create a list of cluster_name = '101' outputs which are then provided to the array_distance limit process - this sql is a hack to try to get some number of results from the HNSW index and then filtering them back to a smaller matching list. This does return a result but is not reliable because results are not really filtering on both restrictions. Explain plan also does not look quite right to me, in terms of sequence of operations.

` explain select from (SELECT FROM my_vector_table where cluster_name = '101' ORDER BY array_distance(vec, [1,2,3]::FLOAT[3]) limit 1000) as tt limit 3;

┌─────────────────────────────┐ │┌───────────────────────────┐│ ││ Physical Plan ││ │└───────────────────────────┘│ └─────────────────────────────┘ ┌───────────────────────────┐ │ PROJECTION │ │ ──────────────────── │ │ #0 │ │ #1 │ │ │ │ ~1000 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ STREAMING_LIMIT │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ vec │ │ cluster_name │ │ array_distance(vec, [1.0, │ │ 2.0, 3.0]) │ │ │ │ ~200000 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ FILTER │ │ ──────────────────── │ │ ((#0 = '101') AND (#0 IS │ │ NOT NULL)) │ │ │ │ ~1000000 Rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ HNSW_INDEX_SCAN │ │ ──────────────────── │ │ my_vector_table (HNSW │ │ INDEX SCAN : │ │ my_hnsw_index) │ │ │ │ Projections: │ │ cluster_name │ │ vec │ │ │ │ ~1000000 Rows │ └───────────────────────────┘ `