asg017 / sqlite-vss

A SQLite extension for efficient vector search, based on Faiss!
MIT License
1.59k stars 59 forks source link

Vector transformation functionality (to reduce size of index) #62

Closed polterguy closed 1 year ago

polterguy commented 1 year ago

I've looked a little bit on Faiss, and I realise one of its really cool features is the ability to "pack" vectors, reducing the size of the index drastically. For instance, OpenAI returns 1,536 dimensions. For a lot of domains, being able to pack these down to 96D would probably create more than enough accuracy. Faiss supports this, and also SQLite-VSS's storage capabilities supports this, through its "index factory" constructs.

However, once queried later down the road, there is no mechanism to take the resulting query vector and reduce down to the correct size required to query the index. I suspect something needs to be applied inside of vssIndexFilter, or one of the init functions, that can somehow for instance take a 1,536 dimensional vector, and return it down to for instance a 96D vector, before querying the index database.

I would love to see this feature in SQLite-VSS, and I'm ready to build it myself too, however I'd need some guidance on exactly how to do it, since I suspect you know more about both SQLite and Faiss than me ...

asg017 commented 1 year ago

I believe Faiss will handle the vector transformations/reductions for you at query time.

Here's an example, with 1 million randomly-generated 4-dimensions vectors reduced to two dimensions. The vss_default uses the default flat index, and vss_pca uses the PCA factory index string

.load dist/debug/vector0
.load dist/debug/vss0
.mode box
.header on
.bail on

-- 1 million 4-dimensional random vectors from 0 to 1
create table vectors as
select 
  value as key,
  json_array(
    abs(random() % 1e6) / 1e6,
    abs(random() % 1e6) / 1e6,
    abs(random() % 1e6) / 1e6,
    abs(random() % 1e6) / 1e6
  ) as value
from generate_series(1, 1e6);

-- uses default flat index
create virtual table vss_default using vss0(a(4));

-- PCA reduction to 2 dimensions
create virtual table vss_pca using vss0(a(4) factory="PCA2,Flat,IDMap2");

-- insert vectors into default table
insert into vss_default(rowid, a)
  select key, value from vectors;

-- train the PCA index using all the vectors, then insert data
insert into vss_pca(operation, a)
  select 'training', value from vectors;

insert into vss_pca(rowid, a)
  select key, value from vectors;

-- KNN search on default index
select rowid, distance, vector_debug(a)
from vss_default
where vss_search(a, vss_search_params('[2, 2, 2, 2]', 5));

-- KNN search on PCA-backed index
select rowid, distance, vector_debug(a)
from vss_pca
where vss_search(a, vss_search_params('[2, 2, 2, 2]', 5));

-- size of default index
select sum(length(idx)) as default_size from vss_default_index;
-- size of PCA index
select sum(length(idx)) as pca_size from vss_pca_index;

Results:

┌────────┬──────────────────┬──────────────────────────────────────────────────┐
│ rowid  │     distance     │                 vector_debug(a)                  │
├────────┼──────────────────┼──────────────────────────────────────────────────┤
│ 506292 │ 4.08295059204102 │ size: 4 [0.994942, 0.974313, 0.993881, 0.995759] │
│ 754518 │ 4.10719203948975 │ size: 4 [0.979396, 0.974073, 0.996634, 0.996860] │
│ 348883 │ 4.1143856048584  │ size: 4 [0.988719, 0.990135, 0.968188, 0.996390] │
│ 70772  │ 4.14767742156982 │ size: 4 [0.997696, 0.946667, 0.994309, 0.988991] │
│ 532113 │ 4.16213321685791 │ size: 4 [0.952292, 0.990048, 0.999301, 0.978707] │
└────────┴──────────────────┴──────────────────────────────────────────────────┘
┌────────┬──────────────────┬──────────────────────────────────────────────────┐
│ rowid  │     distance     │                 vector_debug(a)                  │
├────────┼──────────────────┼──────────────────────────────────────────────────┤
│ 506292 │ 3.6440327167511  │ size: 4 [1.386952, 0.568959, 0.783340, 0.813696] │
│ 70772  │ 3.6797776222229  │ size: 4 [1.377389, 0.557415, 0.792283, 0.808467] │
│ 754518 │ 3.68229603767395 │ size: 4 [1.378701, 0.563132, 0.783279, 0.808978] │
│ 348883 │ 3.68287658691406 │ size: 4 [1.384496, 0.581698, 0.756082, 0.811567] │
│ 432244 │ 3.71703171730042 │ size: 4 [1.371345, 0.558159, 0.782803, 0.804760] │
└────────┴──────────────────┴──────────────────────────────────────────────────┘
┌──────────────┐
│ default_size │
├──────────────┤
│ 24000090     │
└──────────────┘
┌──────────┐
│ pca_size │
├──────────┤
│ 16000334 │
└──────────┘

You can see the size of the index for the PCA approach is 16MB compared to 24MB. Also the vectors distances between the two are slightly different, you can see rowid=70772 is "closer" in the PCA approach because it's not exact as the default flat index. The reconstructed vectors with vector_debug(a) are also slightly different

So I don't think you'll need to do the transformers/dimension reductionality yourself - but let me know if you have trouble trying it out!

polterguy commented 1 year ago

If you're right that would be SUPER cool. What factory should I use to reduce from OpenAI's 1,536 dimensions down to something that's "slightly more useful", preferably saving as much space as possible? Do you know? I suspect 10% of the size would be more than enough, considering 1,536 is based upon "the world", and we've typically got no more than 500 to maybe 5,000 entries in total, and need much less resolution.

If it becomes some 10 to 20 percent less accurate is secondary ...

polterguy commented 1 year ago

Just to verify, the PCAx value is a multiplication of the dimensions of the column resulting in the original dimension, right? Implying if you want to reduce 1,536 to 384 your index becomes as;

embedding(384) factory="PCA4,Flat,IDMap2"

Right ...?

asg017 commented 1 year ago

The Faiss factory docs are here: https://github.com/facebookresearch/faiss/wiki/The-index-factory

I'm pretty sure you can just do:

embedding(1536) factory="PCA384,Flat,IDMap2"

to reduce 1536 dimensions to 384, but I haven't tried it

polterguy commented 1 year ago

This didn't work, the only thing I could get working was having a multiplication factor of for instance 4 and 384 resulting in 1,536 when multiplied ... :/

asg017 commented 1 year ago

Here's another script that has vector of 1536 dimensions that gets reduced on Faiss's side:

.load dist/debug/vector0
.load dist/debug/vss0
.mode box
.header on
.bail on

-- 100,000 1536-dimensional vectors
create table vectors as
with ids as (
  select value as key
  from generate_series(1, 1e5)
)
select 
  key,
  ((select json_group_array(value / key) from generate_series(1, 1536))) as value
from ids
group by 1;

-- uses default flat index
create virtual table vss_default using vss0(a(1536));

-- PCA reduction to 2 dimensions
create virtual table vss_pca using vss0(a(1536) factory="PCA384,Flat,IDMap2");

-- insert vectors into default table
insert into vss_default(rowid, a)
  select key, value from vectors;

-- step 1: train the PCA index using all the vectors
insert into vss_pca(operation, a)
  select 'training', value from vectors;

-- step 2: insert real vector data after training
insert into vss_pca(rowid, a)
  select key, value from vectors;

-- KNN search on default index
select rowid, distance
from vss_default
where vss_search(
  a, 
  vss_search_params(
    (select value from vectors where rowid = 100),
    5
  )
);

-- KNN search on PCA-backed index
select rowid, distance
from vss_pca
where vss_search(
  a, 
  vss_search_params(
    (select value from vectors where rowid = 100),
    5
  )
);

-- size of default index
select sum(length(idx)) as default_size from vss_default_index;
-- size of PCA index
select sum(length(idx)) as pca_size from vss_pca_index;

Results:

┌───────┬──────────┐
│ rowid │ distance │
├───────┼──────────┤
│ 100   │ 0.0      │
│ 99    │ 120.0    │
│ 101   │ 120.0    │
│ 98    │ 240.0    │
│ 102   │ 240.0    │
└───────┴──────────┘
┌───────┬──────────────────────┐
│ rowid │       distance       │
├───────┼──────────────────────┤
│ 100   │ 2.16991802304278e-09 │
│ 99    │ 118.995414733887     │
│ 101   │ 119.077178955078     │
│ 102   │ 238.929534912109     │
│ 98    │ 238.978912353516     │
└───────┴──────────────────────┘
┌──────────────┐
│ default_size │
├──────────────┤
│ 615200090    │
└──────────────┘
┌───────────┐
│ pca_size  │
├───────────┤
│ 166210502 │
└───────────┘

A few notes:

_"index = index_factory(128, "PCA80,Flat"): produces an index for 128D vectors that reduces them to 80D by PCA then does exhaustive search."_

So the factory="PCA384,Flat,IDMap2" for a declared 1536 dimension should work as expected.

polterguy commented 1 year ago

I am sorry if I am wrong, I could never get anything like this to work myself

create virtual table vss_pca using vss0(a(1536) factory="PCA384,Flat,IDMap2");

What I had to do was to create a 384 D index, and then use a factor of 4 in my PCA index. But I'll revert my latest pull request, and we'll figure it out later.

polterguy commented 1 year ago

Here's the error I'm getting as I try this. I've tried it both on your original code, and my modification to ensure there's nothing in my code that results in this

failed: PCA matrix cannot output 384 dimensions from 1536

This is during training where I do something such as follows;

insert into vss_ml_training_snippets (operation, embedding_vss)
    select 'training', embedding_vss from ml_training_snippets where embedding_vss is not null;

Where the ml_training_snippets table's embedding_vss column contains the raw BLOB of the vectors returned by OpenAI's embeddings API.

When I create the VSS table I do it as follows;

create virtual table vss_ml_training_snippets using vss0(
  embedding_vss(1536) factory="PCA384,Flat,IDMap2"
);
polterguy commented 1 year ago

I debugged this more, and here is the code that throws in faiss. It seems "odd". d_out becomes 384 with my vectors, and d_in becomes 1,536. Multiplying these results in 589,824, while the PCAMat.size invocation results in 1,536 multiplied by number of training vectors. I've got 78 in my debug code, so this becomes 119,808. I have modified the faiss code and added printf statements to sanity check things.

Clearly 119,808 is less than 589,824, so the assert fails.

My theory

  1. Faiss for some reasons changed this logic at some point and created a backwards compatibility breaking change, and you're using "old" Faiss code that contains the old logic, and the new logic which I've got due to having recursively cloned your repo, including sub modules has recently been changed.
  2. There's a bug in faiss

I tried to remove the FAISS_THROW_IF_NOT_FMT line of code in faiss, and recompiled, and it actually works - But clearly there's something wrong, since it's resizing my vector at the next line of code in faiss to 589,824.

The way you could test this, is to clone your own repo and run through the entire setup process yourself, at which point you'll download the latest Faiss code, and test your own code further up in our discussion - At which point you can see the error for yourself.

I suspect the way it's working now is as I claim further up, which is as follows;

  1. You've got 1,536 dimensions input vectors
  2. You want to reduce these by a "factor" of 4 so you apply PCA4
  3. 1,536 divided by 4 becomes 384
  4. Your input table's number of dimensions should therefor be 384

The result being that your virtual table declaration needs to change to the following;

embedding(384) factory="PCA4,Flat,IDMap2"

I will post an issue at Faiss and check my assumption asking them for help over there, linking to this issue, to sanity check my logic, and/or make them verify there's not a bug in their own code.

However, if I am right, and the dimension on the shadow table should be 384, then the "sanity check" between the table's dimensions and the input vector needs to be removed, since the table definition will for the above example always be only 25% of the size of the input vector.

As a side note, I suggest you modify how you retrieve the Faiss code, such that it retrieves some "stable tag/release" instead of its latest code.

polterguy commented 1 year ago

I've asked the Faiss developers.

polterguy commented 1 year ago

OK, so I am a little bit embarrassed now. It seems the checkin Faiss is to verify a minimum size of your training data, basically for all practical concerns being at least the number of dimensions in your Faiss index - Which is why it didn't work when training with 78 vectors. I needed at least 384.

Sorry - I'm still learning :D

polterguy commented 1 year ago

For future references, and for others making the same mistake as me, if you have 1,536 reduced to 384 then this works;

insert into vss_ml_training_snippets (operation, embedding_vss)
    select 'training', embedding_vss from ml_training_snippets where embedding_vss is not null limit 385;

But this does not work;

insert into vss_ml_training_snippets (operation, embedding_vss)
    select 'training', embedding_vss from ml_training_snippets where embedding_vss is not null limit 383;

The reason is that the Faiss training logic requires at least as many vectors as the size of your index dimension ...