Open teowave opened 1 year ago
The embeddings are stored inside a Faiss index in a shadow table inside the database.
It's not directly supported (and this will break in the future), but if you wanted to extract the serialized faiss index from indexed vss0 columns, you could do it like so:
.load ./dist/debug/vector0
.load ./dist/debug/vss0
.header on
.mode box
create virtual table example using vss0(
a(2),
b(3)
);
insert into example(rowid, a, b)
select
key as rowid,
value ->> 0 as a,
value ->> 1 as b
from json_each('
[
[[0.0, 0.0], [0.0, 0.0, 0.0]],
[[1.0, 1.0], [1.0, 1.0, 1.0]],
[[2.0, 2.0], [2.0, 2.0, 2.0]],
[[3.0, 3.0], [3.0, 3.0, 3.0]]
]
');
select
rowid,
writefile(format("%d.faissindex", rowid), idx)
from example_index;
You'll end up with 0.faissindex
and 1.faissindex
as files, where the first are the 2-dimensional vectors for the a
column in the faiss index format, and the second is the 3-dimensional vectors for the b
column in the faiss format.
The faiss indexes are created the moment data is insert into a vss0
virtual table. It's serialized and stored when the transaction in committed. It's stored in a shadow table, which is the example_index
table that's referenced above.
So I guess as a simmary:
The faiss project is good to reference to understand how sqlite-vss works under-the-hood
https://github.com/facebookresearch/faiss
https://github.com/facebookresearch/faiss/wiki/Index-IO,-cloning-and-hyper-parameter-tuning
Thank you for the explanation, Alex - very clear and extensive, I got it now. I have read about faiss as well, it looks like if I use a flat index with L2 or cosine search there should be no loss of precision, which was my worry. I will to some tests comparing with my current search solution using dataframes and cosine similarity search, I will see if there is any loss of precision. Will post the results here. All the best
Looking forward to it! To be clear, by default sqlite-vss
uses a "Flat,IDMap2"
Faiss index, as described in the index factory docs. Which will result in an exhaustive yet 100% correct search. You can pass custom factory strings in, which may result in faster but more inaccurate results
create virtual table vss_ivf_articles using vss0(
headline_embedding(384) factory="IVF4096,Flat,IDMap2",
description_embedding(384) factory="IVF4096,Flat,IDMap2"
);
Though depending on your use-case, a little bit of error may be totally fine.
in the vss virtual table - are the embeddings stored as binary? or are they converted back to the initial vectors?
or is it the conversion done at the moment of search? by the search function?