question about the interpretation of hubness measurements

ivan-marroquin commented 2 years ago

Hi,

Many thanks for such interesting package!

I have a question on how to interpret the measures for hubness. Are these measures bounded (like say [0, 1])? Do you have references that describe how to interpret them?

For example, I ran an analysis using my data set and computed some hubness measures for different number of neighbors (the script and dataset are in the attached zip file). I got these results:

for k= 5 2.4749 (skewness) 0.8889 (atkison) 0.8889 (gini) 0.8889 (robinhood) 1.0 (hub occurrence) 0.8889 (anti-hub occurrence)

for k= 15 0.7071 (skewness) 0.6667 (atkison) 0.6667 (gini) 0.6667 (robinhood) 1.0 (hub occurrence) 0.6667 (anti-hub occurrence)

for k= 30 -0.7071 (skewness) 0.3333 (atkison) 0.3333 (gini) 0.3333 (robinhood) 1.0 (hub occurrence) 0.3333 (anti-hub occurrence)

From the point of view of skewness, atkison, gini and robinhood. The hubness is reduced as the number of neighbors to investigate hubness is increased. Is this assumption correct? When the observed value can be considered high (or low)?

What about hub and anti-hub occurrences? How I can have hub occurrence high (and remain the same) while anti-hub decreases with increasing number of neighbors to investigate hubness? How anti-hubness is not relative very low, when hub occurrence is 1?

I thank you in advance for your comments and clarifications.

Kind regards,

Ivan

script_and_test_data.zip

VarIr commented 2 years ago

Hi Ivan,

thanks for your continuing interest.

References: Skewness (of the k-occurrence distribution) for measuring hubness was introduced by Radovanović et al. in their seminal paper, 2010. The Robin Hood index was introduced as a hubness measure by Feldbauer et al. 2018 (there is a link to an open access technical report in the readme, if you don't have access). In the experiments for this paper, we investigated multiple measures from inequality research (Atkinson, Gini, ...). They are still in the code, but were never discussed in any publication because we only followed up on the Robin Hood index. Robin hood tells you: What fraction of all nearest neighbor slots would have to be redistributed as to have all samples equally often as nearest neighbors to other samples (which would mean no hubness at all). For hub and antihub occurrence please have a look at Flexer and Schnitzer, 2015.
Bounds: Skewness is not bounded (one of the reasons I find it hard to interpret). Robin Hood is in [0, 1). Hub and anti-hub occurrence are fractions in [0, 1].
Dependence on the k in k-occurrence: Hubness measures generally decrease when you increase k. Intuitively, when k approaches the number of samples in the dataset, all samples become nearest neighbors to all other samples, k-occurrence becomes a uniform distribution with skewness=0. Obviously, this is rather meaningless. Therefore, small k values are typically used to describe hubness, like 5 or 10.
(Anti)hub occurrence: These two describe slightly different things. While antihub occurrence is the fraction of all samples that are antihubs (never occur as nearest neighbor), hub occurrence is the fraction of all nearest neighbor slots occupied by hubs (that is, it's not the fraction of all samples that are hubs). So they two measures are not tightly coupled. In your example, all nearest neighbor slots are occupied by hubs in all cases. It seems there would be more hubs while increasing k from 5 to 30, thus reducing the number of antihubs. (I could not have a look at your data though).

I hope I could shed some light on these topics.

ivan-marroquin commented 2 years ago

Hi @VarIr

Many thanks for sharing for all this information. It is very helpful.

Ivan

VarIr / scikit-hubness

question about the interpretation of hubness measurements #108