facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.31k stars 3.63k forks source link

Empty index inconsistency #3830

Open ego-thales opened 2 months ago

ego-thales commented 2 months ago

Hello,

When searching more neighbours than there are samples in the population, one expects -1 returned indexes and infinite (before nan_to_num) distances. I tested this on empty indexes and it works as expected with a few samples but not above a certain population size:

>>> faiss.IndexFlatL2(1).search(np.random.random((19, 1)), 1)  # Works fine
(array([[3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38],
       [3.4028235e+38]], dtype=float32), array([[-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1],
       [-1]]))
>>> faiss.IndexFlatL2(1).search(np.random.random((20, 1)), 1)  # First inconsistency at 20 samples
(array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.]], dtype=float32), array([[4606307992086394560],
       [4604998639883125282],
       [4600655086560625350],
       [4598294196564123122],
       [4604232847489403516],
       [4606679014202959133],
       [4602634729945304042],
       [4606974893601702480],
       [4598449504068727218],
       [4599660145183339918],
       [4589041727904493456],
       [4586442628466231040],
       [4606335022048302453],
       [4597148622011556900],
       [4606505784769942931],
       [4592230883026519384],
       [4603841237871291188],
       [4601734269562976314],
       [4595473575588231508],
       [4606528106805786929]]))

With torch.Tensor inputs, it starts bugging at n_samples=21 and not 20.

May be related to #2135(?)

mdouze commented 2 months ago

Presumably because the threshold on nb queries

https://github.com/facebookresearch/faiss/wiki/Implementation-notes#matrix-multiplication-to-do-many-l2-distance-computations

Probably the ids are not filled in in the BLAS case.