Thanks for this library. I'm just playing around with it to see if it can fit in as a replacement for the myriad user-defined sql functions we're currently using to perform knn search on binary features and have a question regarding the use of binary hashes in place of floating-point features/embeddings.
So far as I've been able to tell, FAISS supports IndexBinaryFlat with the string BFlat and with various B-prefixed versions of the index strings for use in the factory constructor, but it's a completely separate base class from the regular index factory. Indeed, trying to use the following:
CREATE VIRTUAL TABLE IF NOT EXISTS "vss_files" using vss0 (
embedding(144) factory="BFlat,IDMap2",
);
throws an exception:
Error building index factory for embedding: Error in std::unique_ptr<faiss::Index> faiss::{anonymous}::index_factory_sub(int, std::string, faiss::MetricType) at /home/runner/work/sqlite-vss/sqlite-vss/vendor/faiss/faiss/index_factory.cpp:877: could not parse index string BFlat
(IDMap2 is, as I understand it, implemented for IndexBinaryFlatsince 2019.)
The only approach I can think of to work around this issue would be to treat the binary hash as a densely packed bitwise representation of a one-hot-encoded embedding and either insert a 1.0 or 0.0 float for each bit (so an n-byte binary vector turns into a n*8*2-byte fp16 embedding) and either insert that directly at a huge storage and compute premium, or take that and compress its features (ProductQuantizer?) into a smaller embedding increasing compute but reducing storage (and performance/accuracy).
Ideally, we would be able to use bfactory= instead of factory= to create a binary index or factory= would introspect its payload for BFlat and create a binary index instead of a regular one?
Thanks for this library. I'm just playing around with it to see if it can fit in as a replacement for the myriad user-defined sql functions we're currently using to perform knn search on binary features and have a question regarding the use of binary hashes in place of floating-point features/embeddings.
So far as I've been able to tell, FAISS supports IndexBinaryFlat with the string
BFlat
and with variousB
-prefixed versions of the index strings for use in the factory constructor, but it's a completely separate base class from the regular index factory. Indeed, trying to use the following:throws an exception:
(
IDMap2
is, as I understand it, implemented forIndexBinaryFlat
since 2019.)The only approach I can think of to work around this issue would be to treat the binary hash as a densely packed bitwise representation of a one-hot-encoded embedding and either insert a
1.0
or0.0
float for each bit (so an n-byte binary vector turns into a n*8*2-byte fp16 embedding) and either insert that directly at a huge storage and compute premium, or take that and compress its features (ProductQuantizer?) into a smaller embedding increasing compute but reducing storage (and performance/accuracy).Ideally, we would be able to use
bfactory=
instead offactory=
to create a binary index orfactory=
would introspect its payload forBFlat
and create a binary index instead of a regular one?