asg017 / sqlite-vss

A SQLite extension for efficient vector search, based on Faiss!
MIT License
1.59k stars 59 forks source link

Different vector storage for write-heavy workflows #30

Open asg017 opened 1 year ago

asg017 commented 1 year ago

With the current "faiss in a shadow table" storage strategy, there's one drawback: every time data in a vss0 table is inserted or updated and a COMMIT happens, the entire Faiss index has to be written to the DB. So if your index as 1 million vectors and you insert 1 more, then 1,000,001 vectors will have to be written to the DB.

This is a limitation of Faiss's IO tools: I believe only entire writes are supported (unless you mmap which isn't possible with in-db storage).

Proposal: A non-faiss index storage option

We should created our own be-spoke index storage option that allows for incremental, streaming writes without re-writing the entire index. This would be outside of Faiss, but can still use some of Faiss's C++ API for querying.

For example, we could have:

create virtual table vss_demo using vss0(
  index_storage="incremental",
  chunk_embeddings(768)
);

insert into demo(rowid, chunk_embeddings) 
  select rowid, chunk_embeddings from demo;

-- inserting 1 row should be fast and not re-write the entire index
insert into demo(rowid, chunk_embeddings)
  values (1234, :embedding);

The index_storage="incremental" option is the key: that signals that instead of the default faiss-shadow-table index storage strategy, the vss_demo table should instead store vectors in this new "incremental" storage format that's not Faiss-based.

We can still use Faiss for actual KNN and range searches, through knn_L2sqr_by_idx () and range_search_L2sqr(). But we'll need to design our own SQLite-based vector storage solution.

Design notes

This is a WIP, will likely change.

create table vss_demo_config(
  chunk_size integer
);
create table vss_demo_streaming_chunks(
  -- blob of size `chunk_size * sizeof(int64)`, storing the rowid values of the corresponding vectors in `vectors`
  rowids blob,
  -- bitmap of size `chunk_size / 8` that tracks if the i-th vector is valid/not-deleted
  valid blob,
  -- blob of size `chunk_size * sizeof(float), containing the raw float vectors
  vectors blob
);

At creation time, an all-zero chunk of size chunk_size is inserted into vss_demo_streaming_chunks. At insert time, each query and their rowid is stored in the rowids and vectors column using SQLite's BLOB I/O API, and the i-th bit in the valid bitmap is updated. At query time, the vectors and rowids columns are read into memory, the range_search_L2sqr() function is called to find the K-nearest vectors in that chunk, and the chunks are de-allocated and the next chunk is worked on. In the end, we re-sort the the results and get the true K-nearest vectors.

Possibly could be a new operation="optimize" option that'll re-arrange these chunks for delete-heavy workflows, to save space.

insert into vss_demo(operation) values ('optimize');

Disadvantages