Curse of dimensionality with Nearest Neighbour approaches?

ejohnson643 / EMBEDR

Statistical quality evaluation of dimensionality reduction algorithms

29 stars 2 forks source link

Curse of dimensionality with Nearest Neighbour approaches? #13

Open mr-september opened 2 years ago

mr-september commented 2 years ago

Hi Authors,

Sorry this is less of a technical issue than a conceptual question.

You have alluded to the curse of dimensionality in the original paper, one major component of which is the dilution of "distances" (e.g. On the Surprising Behavior of Distance Metric in High-Dimensional Space, or a more accessible summary here).

Step 1 of EMBEDR relies on calculating NNs, the code appears to rely on default methods in numpy e.g. "euclidean", "l2", "sqeuclidean", ..., "sokalsneath", "yule"].

Do you have recommendations for these distance metrics to mitigate the "curse"? Or are there other parts of the algorithm to help with that?

ejohnson643 commented 2 years ago

Hello!

Thanks for your question! Yes, the curse of dimensionality definitely impacts the accuracy of NN identification, however, the use of "fuzzy" similarities in t-SNE and UMAP helps to circumvent this problem. EMBEDR uses these same similarities to assess the quality of the embeddings. Furthermore, EMBEDR repeats the embedding process several times, which ideally will average out some of these NN errors.

More concretely, in testing we did not find qualitative differences in the output of the algorithm when the metric used to find NNs was changed. Depending on the data you are analyzing, using different metrics may be more appropriate (the Jaccard distance for analyzing documents, for example), but in general, using the same metric as the dimensionality reduction method of choice is the best option.

mr-september commented 2 years ago

Thank you for the great response! Sorry to be a further bother, but do you have any recommended readings that discuss how t-SNE/UMAP/other DR tools help to mitigate the curse of dimensionality on NN identification? Or perhaps any brief comparisons using the data shared in this study? Sorry I know this would be quite a bit of work, and completely understand if the lab has other priorities.