asg017 / sqlite-vss

A SQLite extension for efficient vector search, based on Faiss!
MIT License
1.59k stars 58 forks source link

IndexBinaryFlat support? #124

Open mqudsi opened 4 months ago

mqudsi commented 4 months ago

Thanks for this library. I'm just playing around with it to see if it can fit in as a replacement for the myriad user-defined sql functions we're currently using to perform knn search on binary features and have a question regarding the use of binary hashes in place of floating-point features/embeddings.

So far as I've been able to tell, FAISS supports IndexBinaryFlat with the string BFlat and with various B-prefixed versions of the index strings for use in the factory constructor, but it's a completely separate base class from the regular index factory. Indeed, trying to use the following:

CREATE VIRTUAL TABLE IF NOT EXISTS "vss_files" using vss0 (
    embedding(144) factory="BFlat,IDMap2",
);

throws an exception:

Error building index factory for embedding: Error in std::unique_ptr<faiss::Index> faiss::{anonymous}::index_factory_sub(int, std::string, faiss::MetricType) at /home/runner/work/sqlite-vss/sqlite-vss/vendor/faiss/faiss/index_factory.cpp:877: could not parse index string BFlat

(IDMap2 is, as I understand it, implemented for IndexBinaryFlat since 2019.)

The only approach I can think of to work around this issue would be to treat the binary hash as a densely packed bitwise representation of a one-hot-encoded embedding and either insert a 1.0 or 0.0 float for each bit (so an n-byte binary vector turns into a n*8*2-byte fp16 embedding) and either insert that directly at a huge storage and compute premium, or take that and compress its features (ProductQuantizer?) into a smaller embedding increasing compute but reducing storage (and performance/accuracy).

Ideally, we would be able to use bfactory= instead of factory= to create a binary index or factory= would introspect its payload for BFlat and create a binary index instead of a regular one?