cvxgrp / pymde

Minimum-distortion embedding with PyTorch
https://pymde.org
Apache License 2.0
537 stars 27 forks source link

Starting MDE from distances or affinities? #27

Closed davisidarta closed 3 years ago

davisidarta commented 3 years ago

Hi @akshayka ! Thank you for your beautiful work and this wonderful library.

I wonder if it is possible to use pyMDE with distances or affinities matrices as input. I'm working on a topological data analysis library and would very much like to include some pyMDE wrappers for graph layout and visualization.

For example, would it be possible to start the pymde with only pairwise distances and/or k-nearest-neighbor distances or affinities? How much change on the current code would be needed to do so? Do you think it would be better to try to implement this on a separate library with calls to pyMDE internals?

Thanks! :)

akshayka commented 3 years ago

Thank you for your beautiful work and this wonderful library.

You're most welcome! Thanks for your interest.

I wonder if it is possible to use pyMDE with distances or affinities matrices as input.

Yes, you can do this with PyMDE.

For example, would it be possible to start the pymde with only pairwise distances and/or k-nearest-neighbor distances or affinities?

Yes, definitely. You'd construct your own MDE object, based on the distances / affinities.

How much change on the current code would be needed to do so?

None. The MDE class is part of the public API (https://pymde.org/mde/index.html, https://pymde.org/api/index.html#mde)

This notebook, starting with "Embeddings, from scratch", shows how to make an embedding based on preserving a KNN graph, given affinities between pairs of items: https://github.com/cvxgrp/pymde/blob/main/examples/mnist.ipynb

This one has some examples of preserving distances: https://github.com/cvxgrp/pymde/blob/main/examples/drawing_graphs.ipynb

Hopefully that helps you get started. Happy to help you along the way.

davisidarta commented 3 years ago

Thank you so much for you warm and kind answer! I spent these last days exploring pyMDE further on. It is even more flexible, complex and rich than I had initially imagined, and I'm thinking about dropping other layout algorithms altogether in favor of pyMDE.

None. The MDE class is part of the public API (https://pymde.org/mde/index.html, https://pymde.org/api/index.html#mde) This notebook, starting with "Embeddings, from scratch", shows how to make an embedding based on preserving a KNN graph, given affinities between pairs of items: https://github.com/cvxgrp/pymde/blob/main/examples/mnist.ipynb

From what I get from your code (it is rich and complex, so forgive me if I misunderstood), I need a tensor of edges to initialize the MDE problem. I confess I'm unused to torch - I have square affinity matrices, either as np.ndarrays or csr matrices, and I'm unsure on how to convert these to the needed edges tensor without corrupting the data.

I understand the pymde.Graph class could be initialized from these affinities, but I failed to understand precisely how - I could only find a method intializing it from a tensor of edges, similarly to the MDE problem.

I apologize for my lack of experience with PyTorch and unability to better handle this on my own. Would you be so kind as to guide me through the best way to adapt mine affinity matrices into an MDE problem?

Particularly, I'm interested on porting TopOMetry metrics into pyMDE, more specifically for the analysis of single-cell data phenotypic topology (TopOMetry is the backbone of a broader toolkit I'm working on). I've been able to generate incredible visualizations of single-cell data with TopOMetry reductions and pyMDE layouts (pyMDE works fantastically well, and I'm only starting exploring different combinations of penalties, constraints and losses).

However, I've found a caveat to this workflow: building the kNN graph within pyMDE is somewhat slower than TopOMetry's similarity learning (mostly because HNSW tends to be faster than pyNNDescent), which can be very significative for the huge datasets involved in single-cell analyses. I'd like to circunvent this by starting the MDE problem with a given affinity matrix obtained with TopOMetry which is nearly trivial to compute.

Thanks again for making such a great tool available and for your willingness to help!

akshayka commented 3 years ago

@davisidarta,

Thanks for the kind words!

I've been able to generate incredible visualizations of single-cell data with TopOMetry reductions and pyMDE layouts (pyMDE works fantastically well, and I'm only starting exploring different combinations of penalties, constraints and losses).

That's so awesome to hear! Whenever you're ready to share, I'd love to learn more about what you've been working on, and the kinds of embeddings you've made.

This is the first I've heard of TopOMetry. It sounds interesting! I will check it out.

(Regarding HNSW: I chose not to use it because (at least when I started development) there was no easy way to include it in a Python package (i.e., there weren't any Python wheels), and I didn't want to maintain a workflow for compiling it.)

I need a tensor of edges to initialize the MDE problem

Yes, that's right.

There are a couple ways to convert an affinity matrix into the format required by PyMDE

Like you mentioned, one way is to use the Graph class. You pass in the affinities to its constructor. Then you can obtain the edges and weights from its accessor methods/properties.

I've put together a simple example with synthetic data here:

https://colab.research.google.com/drive/1_ISn_d2f7tW9rVYUVUCinOpwjIlK30lv?usp=sharing

This example assumes that the affinity matrix is symmetric. Let me know if that assumption is not reasonable for you, or if I've misunderstood what you're trying to do.

Hope that helps! Excited to see what you end up building on top of PyMDE.

ahwillia commented 3 years ago

I had the same question, so figured I would post here. Sorry if I'm being dense or missing this in the documentation somewhere, but how do you switch between using affinities and dissimilarities? If I understand your linked colab notebook correctly, the edge weights are interpreted as affinities.

Right now I've been doing the following (inputing dissimilarities, which happen to also satisfy the triangle inequality):

from sklearn.manifold import MDS
mds = MDS(n_components=2, dissimilarity="precomputed")
mds.fit(precomputed_distances)

What would the equivalent be in PyMDE?

(+1 to all the OP's compliments about this package and manuscript, I'm very excited to dig into it more.)

akshayka commented 3 years ago

@ahwillia, here is the code: https://colab.research.google.com/drive/1Szrw-Mt_yM0kCKe-Jj0S2PbJu-NgobfO?usp=sharing

You'll find that it's very similar to the previous notebook.

I would also encourage you to take a look at the following sections in the documentation:

https://pymde.org/getting_started/index.html#preserving-distances

https://pymde.org/mde/index.html#losses

https://pymde.org/api/index.html#preserve-distances

Hope that helps, thanks for your interest!

EDIT: If the number of items is small, your MDS code will be equivalent to

import pymde

mde = pymde.preserve_distances(precomputed_distances)
embedding = mde.embed()

Here, precomputed_distances should be a SciPy sparse matrix or a pymde.Graph instance.

When the number of items is large, the preserve_distances function will randomly subsample your distance matrix.

You can set up arbitrary MDS problems by using the MDE class directly, as shown in the linked-to notebook.

ahwillia commented 3 years ago

Ok, so the key is just to replace pydme.penalties.Quadratic (for affinities) with pydme.losses.Quadratic (for distances). Got it.

It looks like the preserve_distances helper function is even easier though, I think I confused myself by digging into the more complex API before checking that one out first.

Do I understand this correctly -- if preserve_distances receives a numpy array it assumes it is a data matrix (n_obs, n_features), but if it is a sparse matrix it interprets it as being precomputed? (Looks like this would be the key line.)

Small suggestion, I suppose: it would be nice if preserve_distances could interpret a sparse data matrix as it is (not a precomputed distance matrix). I can cast my dense distance matrix into a sparse matrix before calling it. But perhaps the sklearn API could be mirrored with the default being dissimilarity="euclidean" and allow dissimilarity="precomputed" for convienence.

Thanks again for putting this together -- I can already tell it's going to be very useful for me.

davisidarta commented 3 years ago

Thank you, @akshayka ! This worked perfectly and now my code runs significantly faster!

I've basically initialized a pymde.Graph with the affinity matrix and used it to build custom pyMDE problems that are almost identical to your recipes. I've added an internal wrapper to TopOMetry here, and employ it for laying out topological graphs here.

It is indeed really straighfoward, as shown by the notebooks you very kindly shared. Thank you, once again.

That's so awesome to hear! Whenever you're ready to share, I'd love to learn more about what you've been working on, and the kinds of embeddings you've made.

I still have a huge deal to polish, but you can check how this worked out with TopOMetry and pyMDE here, if you're interested: https://github.com/davisidarta/topometry/blob/master/docs/TopOMetry_Intro_pbmc_10k.md

Things get really interesting when dealing with around >50k samples. Was able to find out ~80 new T lymphocytes populations with TopOMetry and pyMDE in an old public dataset everyone have already exhaustively explored with other methods. Interestingly, these populations are completely hidden if PCA is used for preprocessing, or if no preprocessing is used at all, stressing that the topological base and metrics are needed in such extreme non-linear cases. I'll start drafting a paper about this next week, so just let me know at davisidarta(at)gmail if you're minimally interested in joinining - I would be really really honored, honestly, as pyMDE was a game-changer.

Attached there is an insanely cool GIF made with a TopOMetry diffusion basis and second-order diffusion harmonics, laid out with MDE (which is what enables the fantastic GIF output, of course). pbmc68k_dbMDE.zip

Best wishes! Davi

P.S.: My issue is completely solved, but I did not close it with this comment in respect to @ahwillia. Please feel free to do so when you deem appropriate.

akshayka commented 3 years ago

@davisidarta, I'm elated to hear that PyMDE has been useful for you. Thanks for being an early adopter!

Things get really interesting when dealing with around >50k samples. Was able to find out ~80 new T lymphocytes populations with TopOMetry and pyMDE in an old public dataset everyone have already exhaustively explored with other methods.

That's awesome!!

I'll start drafting a paper about this next week, so just let me know at davisidarta(at)gmail if you're minimally interested in joinining - I would be really really honored, honestly, as pyMDE was a game-changer.

I've sent you an email. Happy to chat!

akshayka commented 3 years ago

Do I understand this correctly -- if preserve_distances receives a numpy array it assumes it is a data matrix (n_obs, n_features), but if it is a sparse matrix it interprets it as being precomputed? (Looks like this would be the key line.)

Yes, you're right.

Small suggestion, I suppose: it would be nice if preserve_distances could interpret a sparse data matrix as it is (not a precomputed distance matrix). I can cast my dense distance matrix into a sparse matrix before calling it. But perhaps the sklearn API could be mirrored with the default being dissimilarity="euclidean" and allow dissimilarity="precomputed" for convienence.

Thanks for the suggestion! I'll definitely consider it. I agree the preserve_distances API could be improved, so I very much appreciate the feedback.

Thanks again for putting this together -- I can already tell it's going to be very useful for me.

You're most welcome!