Performance issue with IndexBinaryHNSW and large number of duplicate entries

bduclaux commented 4 years ago

Hello,

I'm using the C++ IndexBinaryHNSW with large binary vectors (d>100k) in order to cluster a dataset. It works great.

However, when there are a large number of identical vectors, or very close vectors (hamming d<100 bits, so more than 99% of the vector coordinates are the same), I have found that the performance to populate the index drops dramatically (50x drop), compared to the case when such exact duplicates are removed (and only one instance of duplicate points is kept).

I think this is due to the fact that such dataset "traps" the HNSW exploration in highly dense local zones, and hence the algo struggles a lot to get outside these regions and set the corresponding links between points.

The issue is present with the binary version, but I suspect it would remain the same with the dense version.

That being said, I suspect that there could be a more efficient implementation of the HNSW algorithm to manage these cases :

record that point A and point B are equal
don't set 32 / 64 links (M parameter of IndexBinaryHNSW) to the same point when d(A,B)=0 , or d(A,B) < eps, but try to spread them around to other more distant points

At the moment, I am forced to have a preprocessing stage when I am computing vector hashes in order to detect identical points, and filter them before adding them to the IndexBinaryHNSW. It would be wonderful to have this embedded in the FAISS library. This could also open a new method which could be applicable to all types of Indexes: find_duplicates.

Anyway, my immediate concern is the performance issue. I can privately provide a dataset of binary vectors, or a link to see the types of objects which cause the issue.

bduclaux commented 4 years ago

Also, when querying an HNSW index built with duplicate entries, one can see that the search() method struggles to return the K nearest points, when K is large and when the query point has duplicates. Specifically, if I ask for K=500 nn with an index containing 100k entries, it only returns approx. 300. This confirms imho that the HNSW algorithm is stuck in a highly dense region with many duplicates at the query point. Hope it helps !

mdouze commented 4 years ago

Thanks for the careful analysis.

We also noticed that duplicate vectors incur a penalty at search time for IndexIVF variants. It makes sense that this is a problem for HNSW as well.

We decided to not pro-actively detect exact duplicates because this would be a cost for "normal" use cases. However we implemented a layer of de-duplication for IndexIVFFlat (https://github.com/facebookresearch/faiss/blob/36ddba9196f19b640d5ba2ead558d50e02ecde89/IndexIVFFlat.h#L67).

For the approximate duplicates case it is more complex because we cannot use hashes to identify duplicates, and introducing an epsilon is bound to be arbitrary.

bduclaux commented 4 years ago

Thanks, this make sense. I have also implemented exact duplicates detection using a similar approach:

a hash of the vectors (computed with xxhash xxh3, very fast, but beware of potential collisions if dataset is large, see birthday paradox)
an unordered_multimap, exactly like in your IndexIVFFlatDedup.

Near duplicates detection before adding vectors in the Index is a trickier problem. Potential approaches are MinHash, SimHash, but they usually involve some trade-offs (memory or cpu), and they are slower. Makes sense not to tackle this issue in FAISS.

Closing the issue, and thanks for the help!

mdouze commented 4 years ago

added a FAQ entry https://github.com/facebookresearch/faiss/wiki/FAQ#searching-duplicate-vectors-is-slow

facebookresearch / faiss

Performance issue with IndexBinaryHNSW and large number of duplicate entries #1097