criteo / autofaiss

Automatically create Faiss knn indices with the most optimal similarity search parameters.
https://criteo.github.io/autofaiss/
Apache License 2.0
777 stars 72 forks source link

augmenting embeddings with k labels to help segment searches #160

Open 796F opened 1 year ago

796F commented 1 year ago

I have an application where there are 30-40M embeddings from CLIP which I am searching over.
These images were taken from K different sources (facebook, instagram, etc).

I'm hoping to train an embedding to be able to search all sources, or 1 or 2 specific source.

Is there a recommended way to configure faiss / autofaiss for this use case? If not, is there a recommended method (link or paper would be great) Would it work if I encoded these K labels into a small vector, and prepended them to the embeddings? Facebook = [0 0 0 ... 0] + [ 768 len CLIP] Instagram = [1 0 0 ... 0] + [ 768 len CLIP] etc

victor-paltz commented 1 year ago

Hello! You have several options, it depends on your RAM constraints, on if you want something simple and slow or fast but complex, something modulable, etc...

Here are some options:

  1. Build one index per source, call the KNN of the sources you need, and merge the top K result of each source to get your final top K. Autofaiss can build in parallel your KNN indices
  2. Build only one index and use the faiss>=1.7.3 feature to search on a specific set of elements (I would not recommend it as it will be slow). You can also query 10*K embeddings and do a post-filtering
  3. Build N(N+1)/2 Different KNN indices, for all the sources and the pair of sources. It will be the fastest and most performing solution, but you will use too much RAM probably
  4. Build N indices using the same index_key (Your need to use IVF-like indices). You can use autofaiss to build an input index with a sample of your embeddings, and use autofaiss functions to fill n copies of your index with your n sources. At this point, you will have one KNN index per source, but you will be able to merge your indices altogether! This is very modulable and you don't have to retrain any index
  5. (Your solution) concatenate a one-hot source embedding to CLIP embeddings and use only one KNN with a custom query embeddings. It is smart, but you end up arbitrarily putting similar elements into different buckets, reducing the compression quality of your index

I would definitely go for:

I hope it helps!

796F commented 1 year ago

Thanks for the reply Victor, this is super helpful and really appreciate you laying this out for me.  I ended up using weviate/pinecone for their metadata filters, the filtering got much more complex and there wasn't a clean way to hack it anymore.

On Tue, Jun 06, 2023 at 1:40 AM, victor-paltz < @.*** > wrote:

Hello! You have several options, it depends on your RAM constraints, on if you want something simple and slow or fast but complex, something modulable, etc...

Here are some options:

  • Build one index per source, call the KNN of the sources you need, and merge the top K result of each source to get your final top K. Autofaiss can build in parallel your KNN indices
  • Build only one index and use the faiss>=1.7.3 feature to search on a specific set of elements (I would not recommend it as it will be slow). You can also query 10*K embeddings and do a post-filtering
  • Build N(N+1)/2 Different KNN indices, for all the sources and the pair of sources. It will be the fastest and most performing solution, but you will use too much RAM probably
  • Build N indices using the same index_key (Your need to use IVF-like indices). You can use autofaiss to build an input index with a sample of your embeddings, and use autofaiss functions to fill n copies of your index with your n sources. At this point, you will have one KNN index per source, but you will be able to merge your indices altogether! This is very modulable and you don't have to retrain any index
  • (Your solution) concatenate a one-hot source embedding to CLIP embeddings and use only one KNN with a custom query embeddings. It is smart, but you end up arbitrarily putting similar elements into different buckets, reducing the compression quality of your index

I would definitely go for:

  • Option 1 if it is ok to do multi-call for one query, it is simple and you don't do custom modifications on your embeddings
  • Option 5 If you want something simple (probably use one-hot with more weight than just 1 for better performance)
  • Option 4 to have incremental and modulable abilities

I hope it helps!

— Reply to this email directly, view it on GitHub ( https://github.com/criteo/autofaiss/issues/160#issuecomment-1577208905 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAGZA37VLJMV43GFKQUWIP3XJYKZ7ANCNFSM6AAAAAAXLBVNUI ). You are receiving this because you authored the thread. Message ID: <criteo/autofaiss/issues/160/1577208905 @ github. com>