Open cjnolet opened 5 years ago
I see. I will push this up on the priority list and write a pairwise distance kernel (akin to GEMM but more general) that will have a pluggable functor for the implementation, but these functors will be chosen at compile time.
Commenting on the CPU side: Supporting different types of distances should not be too hard for the flat index. The issues are:
From scipy.spatial.distance:
Observations:
I wanted to check in on this in hopes there may be a timeline to get these distances supported. We could really use this feature in UMAP on RAPIDS cuML.
@cjnolet I'm starting this week on this, sorry for the delay.
No problem. We are super excited to have this capability!
The latest github push includes this change, at least using metrics that don't require pre- or post-processing with L2.
https://github.com/facebookresearch/faiss/blob/master/MetricType.h#L21
The preferred GPU API has changed as well to bfKnn
. This supports float16 input data as well (though at the moment both sets of vectors must be float16).
https://github.com/facebookresearch/faiss/blob/master/gpu/GpuDistance.h#L119
A big thank you for providing these new metrics, @wickedfoo and @mdouze! I'm currently in the process of integrating these into cuML
@mdouze,
For cosine distance, the dot product of unit vectors is going to return the k-farthest-neighbors. Are there any tricks we could do to flip the computation into a distance using only the vector norm?
One solution could be to expose an option on bfknn()
to set whether we want a max or a min-heap. I'm wondering if this might be simplest option? Another option could be to expose a cosine distance. Correlation distance can be derived from cosine, so neither option would be a one-off.
I am not sure I understand the problem. If you search with max inner product you get the top-k largest dot products. To make it a cosine distance, you can use
||x - y||^2 = ||x||^2 + ||y||^2 - 2 <x, y> = 2 - 2 <x, y>
@mdouze, this works. For some reason, I thought the inner product was still using a min heap so I was trying to flip the resulting sort order beforehand. Thanks!
There are several algorithms in cuML which need support for different distances. For instance, our NearestNeighbors algorithm should be able to support the distances that sklearn's NearestNeighbors supports.
UMAP-contrib uses nearest neighbor descent and is able to support all of the distances in `scipy.spatial.distance, using a pairwise evaluation function for those which are not L_p based.
It would be really helpful if we were able to do this within FAISS, both supporting more L_p variants within the brute force kNN computation and supporting more distance types in the ANN algorithms overall.