VarIr / scikit-hubness

A Python package for hubness analysis and high-dimensional data mining
BSD 3-Clause "New" or "Revised" License
44 stars 9 forks source link

DisSim implementation correct? #109

Open davnn opened 1 year ago

davnn commented 1 year ago

Hi! Thanks for you work with the package. I'm currently trying to understand the implementation of DisSim, which seems to differ from the paper. If idx and dist contain the indices and distances to the n-1 neighbors, equation 8 from the paper would be something like:

def dissim(x, idx, dist):
    centroids = x[idx].mean(axis=1)
    dist_x, dist_y = x - centroids, x[idx] - centroids[idx]
    norm_dist_x = np.einsum("ij,ij -> i", dist_x, dist_x)[:,np.newaxis]
    norm_dist_y = np.einsum("ijk,ijk -> ij", dist_y, dist_y)
    return dist - np.sqrt(norm_dist_x) - np.sqrt(norm_dist_y)

The implemented version returns something like np.sqrt(dist - norm_dist_x - norm_dist_y), where the norms are generated slightly differently. Is this intended and only a documentation problem?

VarIr commented 1 year ago

Hi David, thanks for your interest in the package. I am not very familiar with the einsum notation, so I cannot comment on these parts. What differences do you see in how the norms are calculated? Note, btw, that the three terms in (8) are squared Euclidean distance, so I assume there should be no np.sqrts. (Also the final sqrt in our implementation is a non-default option of let's say questionable usefulness).

davnn commented 1 year ago

Sorry, I posted the Euclidean distance version as that was the version I am interested in, but you are right, the default should be squared (and it is!).

I am mainly concerned about the squared=False option, because I would expect that Euclidean distances are used if squared=False, but instead np.sqrt is applied over the whole expression leading to a questionable result. Regarding the norm calculation, it's quite tricky to identify where the calculation begins to differ, but it does differ from my proposal of which I am also not certain if it is correct. A small nitpick is that the reference implementation (I only checked that one) mutates the input dist on transform, but I can send a PR if you would like to change that.

VarIr commented 1 year ago

I wonder if there's really something like a "Euclidean distance version". If there is, I assume it should be something like

np.sqrt(squared_dist - sq_norm_dist_x - sq_norm_dist_y)

instead of

dist - np.sqrt(norm_dist_x) - np.sqrt(norm_dist_y)

In the end, in genereal sqrt(a - b) != sqrt(a) - sqrt(b). Now I really do not remember the derivation of DisSim in detail, but I think there was a reason, why Hara et al. used squared Eucl distances specifically. Switching to Euclidean distances before subtracting the local centroids could have an effect on how effectively hubness is reduced. (For details on that, the authors of that paper might be of more help).

Could you point me to where to find this squared=False, ideally a link to the exact line? I seem to only find this, where it's True.

PRs always welcome :)

davnn commented 1 year ago

The authors likely use squared Euclidean distance because it's minimally cheaper / easier to compute, otherwise I don't see an argument. In my case, the input distance matrix is Euclidean, thus it makes more sense to treat the centroid distances as Euclidean as well. Your assumption is also what is implemented in your library, but I believe the assumption might be wrong.

dist - np.sqrt(norm_dist_x) - np.sqrt(norm_dist_y)

should be right, if dist is Euclidean, because then you subtract Euclidean distances from a Euclidean distance. In your reference algorithm, you subtract squared distances and np.sqrt the final result, yielding something not really interpretable as a Euclidean, nor squared Euclidean.

I meant the squared=False option in https://github.com/VarIr/scikit-hubness/blob/main/skhubness/reduction/tests/reference_algorithms.py#L379, which performs sqrt(a - b), but probably should perform sqrt(a) - sqrt(b), because

In the end, in genereal sqrt(a - b) != sqrt(a) - sqrt(b).

VarIr commented 1 year ago

Interesting. It's been a while since I last read the paper and, unfortunately, I don't really have time to look into that in much detail. If the squared Euclidean distances are indeed just a matter of a tiny performance boost, I totally follow your argument. Also, we're on the same page in terms of if the implemented formula can be interpreted as "Euclidean" distance. As in: Not really.

In this link the default also is squared=True, isn't it?

In any case, have you tried to evaluate your implementation? Say if it reduces hubness better or leads to higher accuracy in some task?