asg017 / sqlite-vec

A vector search SQLite extension that runs anywhere!
Apache License 2.0
3.41k stars 122 forks source link

Suggestion for variable length vector columns #16

Open polterguy opened 2 months ago

polterguy commented 2 months ago

I'm a heavy user of sqlite-vss, as you're probably aware of, having provided several PRs for it - And one thing that always bothered me with sqlite-vss, is that the column length is static, as in a part of the column declaration. This creates problems because of combining different vector dimensions requires multiple different tables, one for each vector dimension you intend to use.

I'm not sure how easy it would be to solve, but I assume you can prefix the length of each vector in your shadow table, and if the user calls for comparing a vector of 500 against a source of 1000, for instance, or vice versa, I think you can simply "truncate" the longest vector. For instance, OpenAI's latest "text-embedding-small-002" (I think it's called) still compares relatively accurate if you simply truncate vectors.

As to the memory problem. If you can significantly reduce the memory footprint, that would be a huge thing! I've reduced it a lot myself in sqlite-vss, but it's still a problem ... :/

asg017 commented 2 months ago

Hey @polterguy, great to hear from you! I definitely remember and appreciate your contributions.

Regarding memory: yes sqlite-vec is much more efficient that vss. The vector index isn't stored in memory, and is instead accessed in "chunks" so the overall memory used is much smaller. This means queries area bit slower, since they involved disk access, but should be a fine tradeoff for most applications.

Re adaptive length vectors: column declarations have to store the same length vector for all rows, at least for vec0 virtual tables. Though you can implement a "re-ranking" step manually with matryoshka embedding models like text-embedding-small-002, like:

create virtual table vec_items using vec0(
  title_embedding float[1536],
  title_embedding_coarse float[512]
);

insert into vec_items(rowid, title_embedding, title_embedding_coarse)
  select rowid, :embedding, vec_normalize(vec_slice(:embedding, 0, 512));

And to query:

with coarse_results as (
  select 
    rowid,
    title_embedding,
    distance
  from vec_items
  where title_embedding_coarse match vec_normalize(vec_slice(:query, 0, 512))
    and k = 100
  order by distance
)
select 
  rowid,
  distance
from coarse_results
order by vec_distance_l2(:query, title_embedding)
limit 20;

Here we use a shorter version of title_embedding, the first 512 dimensions. We store both the short and long vectors in separate columns. When querying, we query on the shorter embeddings first, over-sampling to get the closest 100 vectors. Then we re-rank those vectors based on the full version of the vectors, which should give good results. You can adjust the k and oversampling k to tune your results.

Another option is vec_quantize_binary(), which converts float vectors into bit vectors, 32x smaller, and use hamming distance on that. The quality may be a bit worse, but should be much faster.

I'll be adding guides for both of these options to the docs!