Wrappers and a customized HNSW search

alexanderguzhva commented 1 month ago

This PR introduces the following:

cmake/link_to_faiss_lib.cmake, which exposes a useful and reusable CMake link_to_faiss_lib() function
IndexWrapper class, which allows to override functionality of an existing Faiss index using a wrapper programming pattern
IndexHNSWWrapper overrides search() and range_search() methods for IndexHNSW with a search way it is done and used in Milvus (https://github.com/zilliztech/knowhere). Compared to a baseline Faiss version, this search() implementation tries to maintain a stable recall rate with the change of the % of filtered samples. The Faiss baseline range_search() implementation for HNSW raises questions...
IndexBruteForceWrapper overrides search() and range_search() methods to invoke a brute-force search
Bitset utility class
CountSizeIOWriter utility class, which is used for quick-and-dirty way of evaluating the size of a data blob, produced by index_write()

As of the PR as it is, all code changes are located in cppcontrib/knowhere directory. I'm ready for discussion whether certain components need to be moved out of cppcontrib/knowhere and moved to a baseline Faiss code. Also, I'm ready to write unit tests and refactoring, just let me know which ones are needed.

There is a quick-and-dirty benchmark included in the code, but it operates on a random data. And there is no easy way to grab a real bench data for C++ code (you need contrib/dataset.py for BigANN, etc), so I'm ready to discuss python integrations, if needed.

asadoughi commented 1 month ago

Thanks, @alexanderguzhva for the PR! Could you split out the CMake re-factoring as a separate PR to review that separately? We would be able to move forward with that at a faster speed than the rest of the current PR.

For IndexBruteForceWrapper I am curious what benefits you are seeing from that implementation compared to the current IndexFlat implementations, for example?

I haven't had a chance to dig through the HNSW implementation, but will try to find time for that over the next week.

alexanderguzhva commented 1 month ago

@asadoughi #3939

alexanderguzhva commented 1 month ago

wrappers are useful in a situation if you need to grab a index and make it work as it is was a brute-search one. For example, you have HNSWPQ index.

mdouze commented 1 month ago

@alexanderguzhva is there a reason not to replace the Faiss HNSW implementation altogether ? There seems to be a lot of duplication in the HNSW wrapper.

asadoughi commented 1 month ago

Building on top of @mdouze's comment, we should get the equivalent behavior as FAISS if we initialize kAlpha to 1.0f, correct? More generally, are there any backwards incompatible behavior changes?
Could you elaborate on your insights regarding range_search, please? Perhaps share a particular dataset where the proposed HNSW implementation performed significantly better than the current implementation. Thanks!

alexanderguzhva commented 1 month ago

@mdouze @asadoughi I was expecting you guys to maybe conduct some performance testing for your internal use cases. As of now, it is a candidate code, and it is placed as a separate one, because it allows to test both baseline and candidate HNSW search in a single application. I will provide updates for kAlpha and range_search

asadoughi commented 1 month ago

How about benchmarks using the filter dataset from the NeurIPS '23 workshop?

facebookresearch / faiss

Wrappers and a customized HNSW search #3926