asg017 / sqlite-vec

A vector search SQLite extension that runs anywhere!
Apache License 2.0
4.26k stars 135 forks source link

SELECT with JOIN incompatible with standard LIMIT clause #96

Open quantatic opened 2 months ago

quantatic commented 2 months ago

I am attempting to use the following script, but run into issues with the LIMIT clause.

.echo on
.load sqlite-vec/dist/vec0.so

CREATE VIRTUAL TABLE vectors using vec0(
    id              INTEGER PRIMARY KEY NOT NULL,
    vector          FLOAT[4]
);

INSERT INTO vectors (id, vector) VALUES (1, '[1, 2, 3, 4]');
INSERT INTO vectors (id, vector) VALUES (2, '[-1, -2, -3, -4]');
INSERT INTO vectors (id, vector) VALUES (3, '[1, 5, 2, 7]');

CREATE TABLE vectors_metadata(
    id              INTEGER PRIMARY KEY NOT NULL,
    vector_id       INTEGER UNIQUE NOT NULL,
    other_metadata  INT
);

INSERT INTO vectors_metadata (id, vector_id, other_metadata) VALUES (1, 1, 55);
INSERT INTO vectors_metadata (id, vector_id, other_metadata) VALUES (2, 2, 22);
INSERT INTO vectors_metadata (id, vector_id, other_metadata) VALUES (3, 3, -5);

SELECT vectors_metadata.other_metadata
    FROM vectors_metadata
    INNER JOIN vectors ON vectors_metadata.vector_id = vectors.id
    WHERE vectors.vector MATCH '[1, 2, 5, 4]'
    LIMIT 2;

Upon running this with

sqlite3 :memory: < script.sqlite

I receive the following error:

Parse error near line 23: A LIMIT or 'k = ?' constraint is required on vec0 knn queries.

I see another similar issue was opened in #41, which suggests upgrading to a sqlite version >=3.42. I am running a version higher than this, so I suspect this is a separate issue.

$ sqlite3 --version
3.46.1 2024-08-13 09:16:08 c9c2ab54ba1f5f46360f1b4f35d849cd3f080e6fc2b6c60e91b16c63f69aalt1 (64-bit)

I have tested this with the following tagged releases, (and the head of main at the time of posting this issue), and receive the same error with all commits.

asg017 commented 2 months ago

Thanks for the detailed report. A workaround is to use k = 2 in this specific query, like so:

SELECT vectors_metadata.other_metadata
FROM vectors_metadata
INNER JOIN vectors ON vectors_metadata.vector_id = vectors.id
WHERE vectors.vector MATCH '[1, 2, 5, 4]'
  AND k = 2;

Or perform the KNN query in a separate CTE step:

with knn_matches as (
    select id
    from vectors
    where vector match '[1, 2, 5, 4]'
      and k = 2
)
SELECT vectors_metadata.other_metadata
FROM knn_matches
INNER JOIN vectors_metadata ON vectors_metadata.vector_id = knn_matches.id;

In the second query, the k = 2 can be substituted with limit 2 in at least SQLite 3.46, maybe as early as 3.42.

I'm not 100% sure why the single query join doesn't work with the LIMIT clause. I fear this may be an edge case where the SQLite optimizer doesn't properly pass over the LIMIT constraint to the virtual table because of the JOIN.

Since the CTE works as expected, I'm not sure how big of a bug this is. In general I'm trying to only use the k = 999 syntax over the LIMIT 999 syntax because there are weird subtle errors with LIMIT on virtual tables, like this odd bug. If I track down the root cause I can file a bug with the SQLite team, but it might be hard to track down .Personally I like the CTE approach anyway since it clearly separate the KNN search from the metadata JOINs, but that's just personal preference

quantatic commented 2 months ago

Much appreciated for the information. In my particular case, the k = <n> approach doesn't quite do what I want when I add further filters, for instance:

SELECT vectors_metadata.other_metadata
FROM vectors_metadata
INNER JOIN vectors ON vectors_metadata.vector_id = vectors.id
WHERE vectors.vector MATCH '[1, 2, 5, 4]'
  AND k = 2
  AND vectors_metadata.other_metadata > 0;

In this case, the k=2 filter appears to be applied first (getting the two closest matches), and only then are further filters applied. This means that when k=2 is used, for instance, the query may return <2 rows even when two or more valid rows exist, if the two closest vectors (or joined rows) fail to pass the additional conditions. For a more concrete example, with the same database setup as above, I run the following query:

SELECT vectors_metadata.other_metadata
FROM vectors_metadata
INNER JOIN vectors ON vectors_metadata.vector_id = vectors.id
WHERE vectors.vector MATCH '[1, 5, 2, 7]'
    AND k = 2
    AND vectors_metadata.other_metadata > 0;

I expect this query to return the values [55, 22], however the only row returned is [55].

Is there a way to get the behavior I'm looking for in this case with the k = <n> syntax? This may answer the question posed here as well.

Cheers :smile:

davidmezzetti commented 2 months ago

Yes, I was posing a similar question, thank you for being more direct here.

In comparing to a solution like vectorlite, I'm trying to see how sqlite-vec can perform this example: https://github.com/1yefuwang1/vectorlite/blob/main/examples/metadata_filter.py

The workaround here is setting a very big K/LIMIT but I'm curious on the options.

davidmezzetti commented 2 months ago

It seems this example SQL can support this use case.

SELECT id, distance FROM vectors v
WHERE vector MATCH '[1, 2, 5, 4]' AND id in (SELECT id FROM vectors_metadata o WHERE other_metadata > 0)
LIMIT 2
bellyjuice commented 2 months ago

It seems this example SQL can support this use case.

SELECT id, distance FROM vectors v
WHERE vector MATCH '[1, 2, 5, 4]' AND id in (SELECT id FROM vectors_metadata o WHERE other_metadata > 0)
LIMIT 2

It will work but has 2 problem: 1, what if the ids filtered by the second select are too many? 2, there is no way to get the values in other columns in a single query, we have to query again in another table then combine the results.

The FTS4 is also using virtual table and it doesn't have this issue, the fts4 virtual table can join other tables.