facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.32k stars 3.63k forks source link

Extend IDSelector for Batch Subset Search #3046

Open movchan74 opened 1 year ago

movchan74 commented 1 year ago

Summary

We're utilizing FAISS for searches within specific subsets of our data. While the current IDSelector supports specifying a subset of IDs for search, it doesn't accommodate independent subsets for each query within a batch. We have our own implementation (found here) that allows this, but we'd love to leverage official implementations for maintainability and potential community contributions.

Platform

OS: Linux Faiss version: 1.7.4 Installed from: pip Running on: CPU Interface: Python

Details

Currently, IDSelector allows us to specify one subset of IDs for all queries in the batch. Our use-case involves having multiple independent queries with different subsets. Our custom implementation (found here) supports unique subsets for each query vector in the batch, but we're leaning towards leveraging official FAISS tools for broader support and ease of use.

Specifically, here's how we currently use FAISS's subset search:

ids = np.array([633, 9632, 3705, 9614, 42, 1665]).astype('int64')
sel = faiss.IDSelectorBatch(ids)
params = faiss.SearchParametersIVF(sel=sel, nprobe=index.nprobe)
index.search(features[:1], 10, params=params)

The limitation here is that we can specify only one selector per batch. And since we have different subsets for each vector, we only can run it with batch size 1 which is a few times slower than batch query. In contrast, our implementation allows independent subsets for each query within a batch.

Proposal

Would it be possible to extend IDSelector or introduce a new selector that allows for specifying independent subsets for each query in a batch? This would greatly benefit scenarios where each query might need to search within a unique subset.

Additionally, if there's already a way to achieve this, it would be great to know. Otherwise, we're happy to contribute to this feature with guidance.

Reproduction instructions

  1. Utilize the current IDSelectorBatch as shown in the code above.
  2. Note that one can only specify a single subset for all queries in the batch.
mdouze commented 1 year ago

TBH I ran into the same problem for the baseline of the filtered search track here

https://github.com/harsha-simhadri/big-ann-benchmarks/tree/main/neurips23/filter/faiss

For this use case, it was not a huge problem because it is not too suboptimal to parallelize on the caller's side.

Thanks for pointing me to your implementation, I'll think about how to introduce it with the IDSelectors...

movchan74 commented 1 year ago

Hello @mdouze,

Thank you for the swift response and the reference to the filtered search track. I wasn't initially aware that FAISS supports parallel search queries. Having tried it, I'm pleasantly surprised by the performance improvements!

For the sake of sharing and possibly helping others, here's a quick rundown of what I observed:

Performance Metrics:

The performance boost is quite significant with parallel requests, achieving almost a 6x speedup compared to sequential queries.

Here's a snippet of how I did it based on the code you provided.

from multiprocessing.pool import ThreadPool

pool = ThreadPool(8)

def search(f, ids, k=10):
    sel = faiss.IDSelectorBatch(ids)
    params = faiss.SearchParametersIVF(sel=sel, nprobe=index.nprobe)
    return index.search(f, k, params=params)

start_time = time.time()
res = pool.starmap(search, [(features[i:i+1], ids[i]) for i in range(8)])
print(time.time() - start_time)

While this parallel approach solves our immediate problem, I still believe that introducing batch support for different subsets would be a valuable addition to FAISS. It would streamline the process and potentially benefit others with similar use cases.

Again, I appreciate your attention to this matter and look forward to any further developments on this front!

Given that the parallel approach is currently addressing our immediate needs, should I go ahead and close the issue or would you like to keep it open for future reference and potential implementation of the batch support?

mdouze commented 1 year ago

The IDSelectorBatch is efficient for small number of elements because it has a bloom filter and a hash table. Let me mark this as an enhancement. It is not clear how to pass in one selector per query though.

mniedoba commented 10 months ago

Just wanted to add a +1 for this feature request. I'm doing class-conditional searching and being able to do ID or Range Selectors in a batched fashion would be a nice upgrade over the multiprocessing approach suggested here.