gdkrmr / dimRed

A Framework for Dimensionality Reduction in R
https://www.guido-kraemer.com/software/dimred/
GNU General Public License v3.0
73 stars 15 forks source link

AUC_lnK_R_NX doesn't do what the documentation states it does #13

Closed ugroempi closed 6 years ago

ugroempi commented 6 years ago

Thanks for providing dimRed.

The documentation regarding AUC_lnK_R_NX is quite misleading, as you also seem to be aware of in your code. You currently use normalized inverse position weights, instead of the claimed logarithmic ones.

Would be good if documentation were adapted to code or vice versa.

Best, Ulrike

gdkrmr commented 6 years ago

I used equation (17) from the publication below, I did a (brief) attempt to derive it but did not succeed.

Lee, J.A., Peluffo-Ordóñez, D.H., Verleysen, M., 2015. Multi-scale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure. Neurocomputing, 169, 246–261. https://doi.org/10.1016/j.neucom.2014.12.095

ugroempi commented 6 years ago

I don't have access to the 2015 paper you mention above.

The documentation references "Lee, J.A., Renard, E., Bernard, G., Dupont, P., Verleysen, M., 2013. Type 1 and 2 mixtures of Kullback-Leibler divergences as cost functions in dimensionality reduction based on similarity preservation. Neurocomputing. 112, 92-107. doi:10.1016/j.neucom.2012.12.036", which does not have a formula but a plot on the log scale, and the function name seems to implicate ln as well.

The code uses inverse weights instead, which makes perfect sense for emphasizing small neighborhoods over large ones, but does not seem to mimic the visual impression from a plot on the log scale. When trying to plot the difference, I found that the two versions are indeed extremely close. My feeling (like yours, from the comment in your code) would have been to use a normalized version of log(Ks+1)-log(Ks) or (for perfectionists trying to mimic the visual impression) log(Ks+0.5)-log(Ks-0.5).

To see what these do versus the inverse weights: Ks <- 1:100 lnws <- log(Ks+0.5)-log(Ks-0.5);lnws <- lnws/sum(lnws) invws <- 1/Ks; invws <- invws/sum(invws) plot(invws, type="l", main="ln based (blue) and inverse (black) weights", xlim=c(0,10), ylim=c(0,0.22),lwd=2) lines(lnws, col="blue",lwd=2)

To my surprise, the inverse weights put slightly less emphasis on the very small neighborhoods than the log-based weights (I would have had different expectations from BoxCox transformations; but then, the two concepts are applied quite differently here).

Best, Ulrike

gdkrmr commented 6 years ago

sorry, i am not sure if the comment i just deleted was accurate, I will take a look at it

gdkrmr commented 6 years ago

I hope that I clarified it a little bit. For the future would it be better to keep the name for the sake of consistency with the publication or should I rename the function?

ugroempi commented 6 years ago

How about having a transition period with the old name still working but deprecated and a more suitable name already functional?

Best, Ulrike

gdkrmr commented 6 years ago

that what I was thinking about when I said renaming ;-)