facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.58k stars 3.65k forks source link

Wrappers and a customized HNSW search #3926

Open alexanderguzhva opened 1 month ago

alexanderguzhva commented 1 month ago

This PR introduces the following:

As of the PR as it is, all code changes are located in cppcontrib/knowhere directory. I'm ready for discussion whether certain components need to be moved out of cppcontrib/knowhere and moved to a baseline Faiss code. Also, I'm ready to write unit tests and refactoring, just let me know which ones are needed.

There is a quick-and-dirty benchmark included in the code, but it operates on a random data. And there is no easy way to grab a real bench data for C++ code (you need contrib/dataset.py for BigANN, etc), so I'm ready to discuss python integrations, if needed.

asadoughi commented 1 month ago

Thanks, @alexanderguzhva for the PR! Could you split out the CMake re-factoring as a separate PR to review that separately? We would be able to move forward with that at a faster speed than the rest of the current PR.

For IndexBruteForceWrapper I am curious what benefits you are seeing from that implementation compared to the current IndexFlat implementations, for example?

I haven't had a chance to dig through the HNSW implementation, but will try to find time for that over the next week.

alexanderguzhva commented 1 month ago

@asadoughi #3939

alexanderguzhva commented 1 month ago

wrappers are useful in a situation if you need to grab a index and make it work as it is was a brute-search one. For example, you have HNSWPQ index.

mdouze commented 1 month ago

@alexanderguzhva is there a reason not to replace the Faiss HNSW implementation altogether ? There seems to be a lot of duplication in the HNSW wrapper.

asadoughi commented 1 month ago
  1. Building on top of @mdouze's comment, we should get the equivalent behavior as FAISS if we initialize kAlpha to 1.0f, correct? More generally, are there any backwards incompatible behavior changes?
  2. Could you elaborate on your insights regarding range_search, please? Perhaps share a particular dataset where the proposed HNSW implementation performed significantly better than the current implementation. Thanks!
alexanderguzhva commented 1 month ago

@mdouze @asadoughi I was expecting you guys to maybe conduct some performance testing for your internal use cases. As of now, it is a candidate code, and it is placed as a separate one, because it allows to test both baseline and candidate HNSW search in a single application. I will provide updates for kAlpha and range_search

asadoughi commented 1 month ago

How about benchmarks using the filter dataset from the NeurIPS '23 workshop?