facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
30.91k stars 3.61k forks source link

range_search doesn't support a radius vector #2157

Open jurrutiag opened 2 years ago

jurrutiag commented 2 years ago

It caught my attention that range_search doesn't support a vector of radius values as an input, such that each value correspond to the radius to query each row. Would this be faster than query_radius from sklearn's KDTree? If not, is this the reason this is not implemented?

This could be a very good feature since it is the last thing needed for a faster (and accurate) calculation of Kraskov's mutual information.

mdouze commented 2 years ago

I don't understand the purpose of a per-row radius. Do you mean per vector component? Could you give the mathematical definition of the operation?

jurrutiag commented 2 years ago

The purpose is to have a radius per sample, i will provide a mathematical definition with an example:

In the Kraskov mutual information estimator [1] we have N samples X_i, each with d features:

Therefore we have N vectors of dimension d. We want to estimate the mutual information of these samples with respect to a vector of size N. This estimator consists of two parts, first we calculate a radius for each sample and then we calculate how many samples lie within this radius for each sample. So for the second part we define a number nx_i containing the number of samples that lie within radius r_i around sample X_i. This means that each sample has assigned its own radius which most likely will be different to other radiuses.

[1] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Physical Review E, vol. 69, Jun 2004.