VarIr / scikit-hubness

A Python package for hubness analysis and high-dimensional data mining
BSD 3-Clause "New" or "Revised" License
44 stars 9 forks source link

On the computation of antihub occurrence #75

Closed ivan-marroquin closed 2 years ago

ivan-marroquin commented 2 years ago

Hi,

Thanks for such great package!

Looking at the Python code to compute antihub occurrence:

antihubs = np.argwhere(k_occurrence == 0).ravel() antihub_occurrence = antihubs.size / k_occurrence.size

it seems to me that the antihub occurrence should be computed as follows: antihub_occurrence = antihubs.size / n_test

So then, the computation of antihub occurrence matches better the definition outlined in the paper https://www.sciencedirect.com/science/article/pii/S0925231215004336 (see equation 5). The computation of hub occurrence (as defined in the Python code) follows the formulation published in this paper (see equation 6).

Kind regards,

Ivan

VarIr commented 2 years ago

Hi Ivan,

thanks for checking all the details thoroughly! Indeed, it seems counter-intuitive that hub-occurrence and antihub-occurrence are not normalized by the same number, but I think it's correct.

The paper says "[antihub-occurrence] is the percentage of data points that act as anti-hubs". There we are dealing with indexed objects (or training set), because only they can be nearest neighbors (we calculate distances between queries and index, but now within queries). Therefore, we divide by n_train. On the other hand, "[hub-occurrence] is the percentage of hub points in all NN lists". We have a nearest neighbor list of length k for each of the queries (or test objects), so we divide by k and n_test. Does this make sense?

Unfortunately, not all literature on hubness explicitly considers the case, where we don't have all-against-all distances between query objects and indexed objects. This has been a source of confusion also for us.

Best, Roman

ivan-marroquin commented 2 years ago

Hi @VarIr

Thanks for your prompt reply. Following your explanation and the code, I think I understand better the logic behind the computation. There are two regimes: (1) when the score function receives a test data, and (2) when a precomputed distance is used.

So then, the antihub occurence is divided by its size. The size of this occurrence carries in it either n_test (first regime) or n_train (second regime). Note that n_test == n_train when using a precomputed distance.

Ivan