Open movchan74 opened 1 year ago
TBH I ran into the same problem for the baseline of the filtered search track here
https://github.com/harsha-simhadri/big-ann-benchmarks/tree/main/neurips23/filter/faiss
For this use case, it was not a huge problem because it is not too suboptimal to parallelize on the caller's side.
Thanks for pointing me to your implementation, I'll think about how to introduce it with the IDSelectors...
Hello @mdouze,
Thank you for the swift response and the reference to the filtered search track. I wasn't initially aware that FAISS supports parallel search queries. Having tried it, I'm pleasantly surprised by the performance improvements!
For the sake of sharing and possibly helping others, here's a quick rundown of what I observed:
Performance Metrics:
The performance boost is quite significant with parallel requests, achieving almost a 6x speedup compared to sequential queries.
Here's a snippet of how I did it based on the code you provided.
from multiprocessing.pool import ThreadPool
pool = ThreadPool(8)
def search(f, ids, k=10):
sel = faiss.IDSelectorBatch(ids)
params = faiss.SearchParametersIVF(sel=sel, nprobe=index.nprobe)
return index.search(f, k, params=params)
start_time = time.time()
res = pool.starmap(search, [(features[i:i+1], ids[i]) for i in range(8)])
print(time.time() - start_time)
While this parallel approach solves our immediate problem, I still believe that introducing batch support for different subsets would be a valuable addition to FAISS. It would streamline the process and potentially benefit others with similar use cases.
Again, I appreciate your attention to this matter and look forward to any further developments on this front!
Given that the parallel approach is currently addressing our immediate needs, should I go ahead and close the issue or would you like to keep it open for future reference and potential implementation of the batch support?
The IDSelectorBatch is efficient for small number of elements because it has a bloom filter and a hash table. Let me mark this as an enhancement. It is not clear how to pass in one selector per query though.
Just wanted to add a +1 for this feature request. I'm doing class-conditional searching and being able to do ID or Range Selectors in a batched fashion would be a nice upgrade over the multiprocessing approach suggested here.
Summary
We're utilizing FAISS for searches within specific subsets of our data. While the current
IDSelector
supports specifying a subset of IDs for search, it doesn't accommodate independent subsets for each query within a batch. We have our own implementation (found here) that allows this, but we'd love to leverage official implementations for maintainability and potential community contributions.Platform
OS: Linux Faiss version: 1.7.4 Installed from: pip Running on: CPU Interface: Python
Details
Currently,
IDSelector
allows us to specify one subset of IDs for all queries in the batch. Our use-case involves having multiple independent queries with different subsets. Our custom implementation (found here) supports unique subsets for each query vector in the batch, but we're leaning towards leveraging official FAISS tools for broader support and ease of use.Specifically, here's how we currently use FAISS's subset search:
The limitation here is that we can specify only one selector per batch. And since we have different subsets for each vector, we only can run it with batch size 1 which is a few times slower than batch query. In contrast, our implementation allows independent subsets for each query within a batch.
Proposal
Would it be possible to extend
IDSelector
or introduce a new selector that allows for specifying independent subsets for each query in a batch? This would greatly benefit scenarios where each query might need to search within a unique subset.Additionally, if there's already a way to achieve this, it would be great to know. Otherwise, we're happy to contribute to this feature with guidance.
Reproduction instructions
IDSelectorBatch
as shown in the code above.