facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.06k stars 3.61k forks source link

Faiss `pairwise_distances` doesn't always return the same number of zero values as scipy cKDTree `sparse_distance_matrix` or `cdist` #2162

Closed ReagTea closed 2 months ago

ReagTea commented 2 years ago

Summary

Trying to compute distance matrix on two identical vectors with faiss.pairwise_distances and compare it to scipy cKDTree.sparse_distance_matrix frequently fails because it seems that Faiss returns less exactly 0 distance values in the matrix. In consequence, it can happen that the number of non zero values differ between both matrices.
The results of cKDTree.sparse_distance_matrix can always be matched exactly by scipy.spatial.distance.cdist, less frequently so by sklearn.metrics.pairwise_distances.

I can imagine that it could be due to some numerical precision errors, but I'd like to understand algorithmically how Scipy can return the exact expected number of 0 distance values when computing a distance matrix on two same vectors.
Thank you.

Platform

OS: Linux Debian amd64

Faiss version: 1.7.1 MKL / llvm Openmp

Installed from: Conda channel conda-forge

Faiss compilation options: MKL, AVX2

Running on:

Interface:

Reproduction instructions

import numpy as np
import faiss
from scipy.spatial import cKDTree
from scipy.sparse import coo_matrix
# basic random vectors to emulate the range of values in the data.
maxdist = 10.0 # ignore this parameter for now
rng = np.random.default_rng(1234)
X = rng.normal(0, 0.005, size=(5000, 24)).astype(np.float32)
# The problem usually happens when both vectors are the same, when there should be a lot of exactly 0 distance values
Q = X
dis = faiss.pairwise_distances(X, Q)
dis = dis**0.5
dis[dis>maxdist] = 0
dis = coo_matrix(dis)
# vs
t1 = cKDTree(X)
t2 = cKDTree(Q)
sdm = t1.sparse_distance_matrix(t2, max_distance=maxdist, output_type='coo_matrix')
# the shape of dis.data and sdm.data will generally not be the same
# it seems that due to precision errors maybe, there are less exactly 0 distance values in the faiss distance matrix
# hence the number of nz values are greater in the faiss distance matrix.
assert dis.count_nonzero() ==  sdm.count_nonzero()
mdouze commented 2 years ago

This may be due to the transformation into a matrix multiplication, see https://github.com/facebookresearch/faiss/wiki/Implementation-notes#matrix-multiplication-to-do-many-l2-distance-computations

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.