asg017 / sqlite-vss

A SQLite extension for efficient vector search, based on Faiss!
MIT License
1.59k stars 59 forks source link

Handle larger Indices (current limit 1GB) #1

Open asg017 opened 1 year ago

asg017 commented 1 year ago

The vector indices that support the vss0 virtual table are limited to 1GB. This is because they are stored as a BLOB in a single row in a shadow table, which has a limit of ~1GB.

Instead, we should store large FAISS indices across several rows, so they can (in theory) grow with infinite space. This will likely be complicated and require the SQLite BLOB I/O API and a custom faiss IOWriter.

chetanladdha commented 1 year ago

@asg017: Do you have any planned for this fix? Any timeline?

asg017 commented 1 year ago

No timeline! I haven't had time to work on core sqlite-vss features recently.

But if anyone reading this would like to sponsor this work, let me know!

siscia commented 1 year ago

Do we have an rough estimate of how many vectors can be index with 1GB?

asg017 commented 1 year ago

Depends on the dimensions of your vectors, and if you use any additional faiss factory strings with factory="".

If you're using the default settings, the size of your index in bytes is roughly:

dimensions * 4 * number_vectors

(where 4 is sizeof(float))

With some additional overhead (a BTree that maps rowid's to each vector, some Faiss-specific storage stuff, etc.).

In the "headlines" example in this blog post, there are 209,527 vectors with 384 dimensions (using sentence transformers), and it takes up roughly 323.5MB of space.

You can also use the factory= option to change how Faiss stores and queries your vectors. There are a ton of options, but "dimension reduction" techniques could help you lower the amount of storage your index takes, to stay under the 1GB limit.

I don't have too many examples of this, but it was discussed at length in this issue. The general gist is to use the PCA directive in a custom factory string:

create virtual table vss_pca using vss0(a(1536) factory="PCA384,Flat,IDMap2");

Then train your index:

insert into vss_pca(operation, a)
  select 'training', value from vectors;

Then insert your vector data:

insert into vss_pca(rowid, a)
  select key, value from vectors;

This example reduces 1536-dimension vectors to 384 dimensions. The storage saving get better with larger datasets, but my quick test was 25% the size of the original full-length index, with 100,000 vectors.

This approach does reduce the accuracy of KNN-style searches however, so use caution.

baughmann commented 6 months ago

No timeline! I haven't had time to work on core sqlite-vss features recently.

But if anyone reading this would like to sponsor this work, let me know!

I did just that

asg017 commented 6 months ago

Hey @baughmann , thanks for the donation! I'm going to give this a shot over the holiday.

Although to be clear, this will will only bump the 1GB limit that sqlite-vss has for indices, to (hopefully) an arbitary size. This will be done by splitting the Faiss index into multiple chunks and storing across multiple rows in shadow tables, instead of the current "store 1 big blob in a single row).

However, the following shortcomings will still exist:

I've been thinking a lot about these shortcomings. There is #30 where I lay out a possible workaround, but that's separate from this issue. I'll also probably chew a bit on this over my break, but will definitely try to lift this 1GB limit in the following weeks!

cduk commented 3 months ago

Just wondering if there were any developments on this front. I guess my use case is a simple one as I have around 3-4 million embeddings to index which is >1GB, however as the data is static, I don't have the update concerns you outline in #30.

slavakurilyak commented 3 months ago

+1 for solving this problem as it would unlock vector use cases and possibly integrate with @tursodatabase which has scales sqlite in production workloads

baughmann commented 2 months ago

For everyone waiting on this, it may be better to just create an implementation using the Repository pattern and have that repository instance maintain a FAISS index alongside the SQLite database. On startup, the repository should load a BLOB column of vectors into the Index. You can utilize the IndexIDMap to track which documents were returned in your similarity search by linking that ID column to an auto-incrementing integer column on the SQLite table.

I implemented one of these myself. It will probably end up being faster and more scalable than anything that can be done in a native SQLite plugin.

cchance27 commented 2 months ago

Any chance you could share an example implementation of that

baughmann commented 2 months ago

I was trying to think of a way to make it into a library, but because people's metadata structures are so different it would be difficult to make it generic enough.

However, I believe I have a notebook that just uses Pandas/PyTorch/Numpy/SQLite that I could minimize and throw up in a gist or something. I can try to do the same for the FAISS/Numpy/SQLite implementation.

I'm out of the country for a few weeks but will try to get around to it this week if I can.

That said, it's actually not difficult at all. If you can't wait on me, look up some FAISS/PyTorch similarity searches tutorials. The key difference is that you need to deserialize/serialize the index members and save them to a BLOB column in SQLite.

The downside for the FAISS implementation is that metadata filtering needs to be done after similarity search since you can't filter a FAISS index by anything other than it's vector content. The complication with the PyTorch/Pandas/Numpy implementation is that you need to copy the entire active data frame if you want to filter by metadata before similarity search.

Neither implementation is perfect, but will probably get you pretty far with some basic optimizations.

On Sat, Apr 27, 2024, 4:49 PM Chris Chance @.***> wrote:

Any chance you could share an example implementation of that

— Reply to this email directly, view it on GitHub https://github.com/asg017/sqlite-vss/issues/1#issuecomment-2081163849, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPEFA4FNDCXD7CSU667F5TY7P6MRAVCNFSM6AAAAAAUQTEU7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGE3DGOBUHE . You are receiving this because you were mentioned.Message ID: @.***>