facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31k stars 3.61k forks source link

Support as many distances as possible from scipy.spatial.distance #848

Open cjnolet opened 5 years ago

cjnolet commented 5 years ago

There are several algorithms in cuML which need support for different distances. For instance, our NearestNeighbors algorithm should be able to support the distances that sklearn's NearestNeighbors supports.

UMAP-contrib uses nearest neighbor descent and is able to support all of the distances in `scipy.spatial.distance, using a pairwise evaluation function for those which are not L_p based.

It would be really helpful if we were able to do this within FAISS, both supporting more L_p variants within the brute force kNN computation and supporting more distance types in the ANN algorithms overall.

wickedfoo commented 5 years ago

I see. I will push this up on the priority list and write a pairwise distance kernel (akin to GEMM but more general) that will have a pluggable functor for the implementation, but these functors will be chosen at compile time.

mdouze commented 5 years ago

Commenting on the CPU side: Supporting different types of distances should not be too hard for the flat index. The issues are:

mdouze commented 5 years ago

From scipy.spatial.distance:

Observations:

cjnolet commented 4 years ago

I wanted to check in on this in hopes there may be a timeline to get these distances supported. We could really use this feature in UMAP on RAPIDS cuML.

wickedfoo commented 4 years ago

@cjnolet I'm starting this week on this, sorry for the delay.

cjnolet commented 4 years ago

No problem. We are super excited to have this capability!

wickedfoo commented 4 years ago

The latest github push includes this change, at least using metrics that don't require pre- or post-processing with L2.

https://github.com/facebookresearch/faiss/blob/master/MetricType.h#L21

The preferred GPU API has changed as well to bfKnn. This supports float16 input data as well (though at the moment both sets of vectors must be float16).

https://github.com/facebookresearch/faiss/blob/master/gpu/GpuDistance.h#L119

cjnolet commented 4 years ago

A big thank you for providing these new metrics, @wickedfoo and @mdouze! I'm currently in the process of integrating these into cuML

cjnolet commented 4 years ago

@mdouze,

For cosine distance, the dot product of unit vectors is going to return the k-farthest-neighbors. Are there any tricks we could do to flip the computation into a distance using only the vector norm?

One solution could be to expose an option on bfknn() to set whether we want a max or a min-heap. I'm wondering if this might be simplest option? Another option could be to expose a cosine distance. Correlation distance can be derived from cosine, so neither option would be a one-off.

mdouze commented 4 years ago

I am not sure I understand the problem. If you search with max inner product you get the top-k largest dot products. To make it a cosine distance, you can use

||x - y||^2 = ||x||^2 + ||y||^2 - 2 <x, y> = 2 - 2 <x, y>

cjnolet commented 4 years ago

@mdouze, this works. For some reason, I thought the inner product was still using a min heap so I was trying to flip the resulting sort order beforehand. Thanks!