asg017 / sqlite-vss

A SQLite extension for efficient vector search, based on Faiss!
MIT License
1.59k stars 59 forks source link

in the vss virtual table - are the embeddings stored as binary? or are they converted back to the initial vectors? #17

Open teowave opened 1 year ago

teowave commented 1 year ago

in the vss virtual table - are the embeddings stored as binary? or are they converted back to the initial vectors?

or is it the conversion done at the moment of search? by the search function?

asg017 commented 1 year ago

The embeddings are stored inside a Faiss index in a shadow table inside the database.

It's not directly supported (and this will break in the future), but if you wanted to extract the serialized faiss index from indexed vss0 columns, you could do it like so:

.load ./dist/debug/vector0
.load ./dist/debug/vss0

.header on
.mode box

create virtual table example using vss0(
  a(2),
  b(3)
);

insert into example(rowid, a, b)
  select
    key as rowid,
    value ->> 0 as a,
    value ->> 1 as b
  from json_each('
    [
      [[0.0, 0.0], [0.0, 0.0, 0.0]],
      [[1.0, 1.0], [1.0, 1.0, 1.0]],
      [[2.0, 2.0], [2.0, 2.0, 2.0]],
      [[3.0, 3.0], [3.0, 3.0, 3.0]]
    ]
  ');

select 
  rowid, 
  writefile(format("%d.faissindex", rowid), idx)
from example_index;

You'll end up with 0.faissindex and 1.faissindex as files, where the first are the 2-dimensional vectors for the a column in the faiss index format, and the second is the 3-dimensional vectors for the b column in the faiss format.

The faiss indexes are created the moment data is insert into a vss0 virtual table. It's serialized and stored when the transaction in committed. It's stored in a shadow table, which is the example_index table that's referenced above.

So I guess as a simmary:

The faiss project is good to reference to understand how sqlite-vss works under-the-hood

https://github.com/facebookresearch/faiss

https://github.com/facebookresearch/faiss/wiki/Index-IO,-cloning-and-hyper-parameter-tuning

teowave commented 1 year ago

Thank you for the explanation, Alex - very clear and extensive, I got it now. I have read about faiss as well, it looks like if I use a flat index with L2 or cosine search there should be no loss of precision, which was my worry. I will to some tests comparing with my current search solution using dataframes and cosine similarity search, I will see if there is any loss of precision. Will post the results here. All the best

asg017 commented 1 year ago

Looking forward to it! To be clear, by default sqlite-vss uses a "Flat,IDMap2" Faiss index, as described in the index factory docs. Which will result in an exhaustive yet 100% correct search. You can pass custom factory strings in, which may result in faster but more inaccurate results

create virtual table vss_ivf_articles using vss0(
  headline_embedding(384) factory="IVF4096,Flat,IDMap2",
  description_embedding(384) factory="IVF4096,Flat,IDMap2"
);

Though depending on your use-case, a little bit of error may be totally fine.