facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
30.5k stars 3.56k forks source link

Issue with search on testing post filtering #3730

Closed Jae0Kang closed 4 days ago

Jae0Kang commented 1 month ago

Summary

While testing post-filtering that I implemented with Faiss in C++, I encountered an issue where the KNN search does not return the expected neighbors connected in the graph, even though k and efSearch values were increased.

Dataset: Used Qdrant's ANN filtering benchmark dataset (H&M clothing data) from this repository. The dataset size is 105,100. Dimension is 2048.

Steps of post-filtering:

  1. Perform a KNN search and filter the results.
  2. If the filtered result is less than k, double the k value and run the KNN search again, continuing this process until k becomes larger than the dataset size (105,100).

Test Case: Parameter I used:

Test result:

Comparison with Answers: Compared KNN results with answers created by using indexflatl2. Test answers (format: (index, distance)):

Additional Checks: Verified if (92552, 178.385) is connected to any node in the KNN search result of using index.hnsw.neighbors[]. The KNN search returns neighbors of (92552, 178.385) with longer distances but does not return (92552, 178.385) itself.

Question: My understanding is that the KNN search should return the k neighbors with the shortest distances to the query. Is this an expected result, or is there a potential error in the search implementation?

Platform

OS: Linux

Faiss version:

Installed from:

Faiss compilation options:

Running on:

Interface:

Reproduction instructions

mdouze commented 1 month ago

It could be some quirk with the data distribution or the implementation. Which vector(s) did you use as queries? The vectors themselves?

mdouze commented 1 month ago

I tried to repro the issue, see issue_3730.ipynb

I did not see anything suspicious. Could you precisely state which vector you expect to find given which query vector?

Jae0Kang commented 1 month ago

Yes. There are 10,000 test cases in qdrant's second benchmark which has data of h&m. I ran knn search with k = 300, and I expected vectors whose indexes are

asadoughi commented 1 month ago

@Jae0Kang Can you provide a way to completely reproduce the issue you are seeing? For example, @mdouze asked for specific the query vector as part of your search. We can't investigate the issue further until we have a way to get the same results you are seeing.

Jae0Kang commented 1 month ago

@asadoughi this is the code that I tested and the result text file. As you can see in the text file, it is showing there are 5156 data which is satisfying the filtering condition. However, the accuracy (num. of intersection of indexflatl2 and hnsw knn / num. of indexflatl2) of post filtering is 0.1. I would really appreciate if you can take a look of it.. faiss_postfilter_code.txt faiss_postfilter.txt

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 7 days with no activity.

asadoughi commented 3 weeks ago

Your code has references to data files that were not shared. We cannot reproduce the issue without a complete example.

mdouze commented 2 weeks ago

Catching up on this Thanks for posting the C++ reproducing the issue, but it is too complex (you provided the data before). I still don't understand which is/are the query vector(s) that you are using? This would make it possible to reproduce in python.

github-actions[bot] commented 1 week ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 4 days ago

This issue was closed because it has been inactive for 7 days since being marked as stale.