YingfanWang / PaCMAP

PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure
Apache License 2.0
521 stars 53 forks source link

Enhancement requests. Intermediate manifold intersection and custom distance metrics #18

Open rhysnewell opened 2 years ago

rhysnewell commented 2 years ago

Hi there!

I currently use UMAP a lot in my projects and am aware of some its limitations so it's great to see new algorithms popping up that aim to fix some of the shortcomings in the field. UMAP has been around for awhile now and as such the community has made some pretty cool modifications and enhancements to actual algorithm. One of the best features, I think, is the ability to perform intersections and unions on the intermediate fuzzy representations: https://umap-learn.readthedocs.io/en/latest/composing_models.html This lets the user generate a unified UMAP representation using multiple different distance metrics with ease. I believe that PaCMAP also has an intermediate manifold stage that might make it possible to implement a similar feature, but perhaps I'm wrong on this. Either way, I just wanted to bring it to your attention and perhaps get your thoughts on it.

The other request is something I'm sure is likely to come in future, but that is the ability to easily supply custom distance metrics to PaCMAP much like how the python UMAP handles this: https://umap-learn.readthedocs.io/en/latest/parameters.html?highlight=metric#metric

I realise that these likely non-trivial requests but I hearing what you think about them would be appreciated.

Cheers, Rhys

hyhuang00 commented 2 years ago

Hi Rhys! Thank you for your interest in PaCMAP and we appreciate your comments on enhancement.

Regarding the second request, we do have plans to extend PaCMAP towards more custom distance metrics. The current implementation rely on the L2 distance to select the mid-near pairs, so modifying the PaCMAP will require much time and effort. If there's a particular metric that you are interested in, and maybe we can prioritize that.

Regarding the first request, I think it's cool and very interesting. Are you aware of any use case that may require this feature?

rhysnewell commented 2 years ago

That's great to hear!

Well, I use the intersection method all the time in my tools Lorikeet and Rosella and their submodule flight to help some clustering stuff. But I'm not sure of other use cases

hyhuang00 commented 2 years ago

L1 distance, angular distance and Hamming distances are now supported in the newly released v0.5.4. Is there any other distance metric you would like to see to be implemented in PaCMAP?

ivan-marroquin commented 2 years ago

Hi @hyhuang00

Thanks for adding new metrics to PaCMAP! Could you include Minkowski with p in [0,1]. According to recent publications on curse of dimensionality, metrics like Minkowski with p=0.3 are more suitable compared to Euclidean distance.

Ivan

antoineouellet commented 2 years ago

our group uses the canberra metric quite a bit, and have found it to be the main reason we don't migrate to PaCMAP, even though we enjoy it very much. I tried adding the definition of the metric to the module until I realized reliance on AnnoyIndex keeps me from easily doing so. So here's a next metric suggestion!

hyhuang00 commented 2 years ago

Thank you all @ivan-marroquin @antoine-gs for your suggestions. Distance metrics such as general Minkowski distances and Canberra metric are not currently supported by the Annoy Index, which is quite frustrating. One possible workaround I could think of is to precompute the distance matrix and follow the custom distance matrix pair extraction method mentioned in README. I will try to find some alternative packages that can work on these metrics.

hyhuang00 commented 2 years ago

Seems like similar packages such as openTSNE are using the nearest neighbor trees from scikit-learn for custom distance metrics. We can also adopt that approach, but I think it is slower than the ANNOY approach.

antoineouellet commented 2 years ago

I am very early in the process of checking this out, but seems like UMAP is moving towards pynndescent which supports mostly same distances a sklearn. I have never used either (annoy and pynndescent) so I don't know how they compare or even if it's even a possible/sensible avenue. Will keep an eye out, thank you!