Open alexanderguzhva opened 1 month ago
Thanks, @alexanderguzhva for the PR! Could you split out the CMake re-factoring as a separate PR to review that separately? We would be able to move forward with that at a faster speed than the rest of the current PR.
For IndexBruteForceWrapper I am curious what benefits you are seeing from that implementation compared to the current IndexFlat implementations, for example?
I haven't had a chance to dig through the HNSW implementation, but will try to find time for that over the next week.
@asadoughi #3939
wrappers are useful in a situation if you need to grab a index and make it work as it is was a brute-search one. For example, you have HNSWPQ index.
@alexanderguzhva is there a reason not to replace the Faiss HNSW implementation altogether ? There seems to be a lot of duplication in the HNSW wrapper.
@mdouze @asadoughi I was expecting you guys to maybe conduct some performance testing for your internal use cases. As of now, it is a candidate code, and it is placed as a separate one, because it allows to test both baseline and candidate HNSW search in a single application. I will provide updates for kAlpha and range_search
How about benchmarks using the filter dataset from the NeurIPS '23 workshop?
This PR introduces the following:
cmake/link_to_faiss_lib.cmake
, which exposes a useful and reusable CMakelink_to_faiss_lib()
functionIndexWrapper
class, which allows to override functionality of an existing Faiss index using a wrapper programming patternIndexHNSWWrapper
overridessearch()
andrange_search()
methods for IndexHNSW with a search way it is done and used in Milvus (https://github.com/zilliztech/knowhere). Compared to a baseline Faiss version, thissearch()
implementation tries to maintain a stable recall rate with the change of the % of filtered samples. The Faiss baselinerange_search()
implementation for HNSW raises questions...IndexBruteForceWrapper
overridessearch()
andrange_search()
methods to invoke a brute-force searchBitset
utility classCountSizeIOWriter
utility class, which is used for quick-and-dirty way of evaluating the size of a data blob, produced byindex_write()
As of the PR as it is, all code changes are located in
cppcontrib/knowhere
directory. I'm ready for discussion whether certain components need to be moved out ofcppcontrib/knowhere
and moved to a baseline Faiss code. Also, I'm ready to write unit tests and refactoring, just let me know which ones are needed.There is a quick-and-dirty benchmark included in the code, but it operates on a random data. And there is no easy way to grab a real bench data for C++ code (you need
contrib/dataset.py
for BigANN, etc), so I'm ready to discuss python integrations, if needed.