Open jolespin opened 6 months ago
Hi,
Sorry, forgot to respond to this. The nontrivial eigenvalues are negative because we use the convention where we look at the eigenvalues of the transition rate matrix. This matrix has the advantage that it converges to a fixed limit as we decrease epsilon. The dominant eigenfunction, which should have eigenvalue of 0, should then simply be a vector of all ones: this corresponds to fact that the rows of a transition rate matrix sum to zero. As such, it doesn't tell you anything meaningful about the data.
Let me know if that helps :-)
This does help a bit thank you! Does PyDiffMap drop this uninformative dimension in the backend or should I do this post hoc?
By default, we drop the dominant eigenfunction here.. If after this drop you still have another eigenvalue that is also 0, that suggests that your transition rate matrix is effectively disconnected: you have two large clusters in your data. You can verify this by checking if the resulting coordinate is a constant function or not. For the plot you provided, it suggests that you do indeed have highly separate clusters.
If you don't want to disjointed clusters, you can consider increasing the value of the bandwidth.
Excellent! This is very helpful. For my actual dataset, I'm dealing with boolean features so using jaccard distance and will need to mess around with some settings.
I plan on trying this for a very large genomics dataset but was having some performance issues. I was looking into what other packages are doing to address performance issues and noticed that scanpy
supports approximated nearest neighbor algorithms (e.g., Spotify's Annoy algorithm, Facebook's FAISS algorithm). Is there any interest in supporting these? I can also open up a new thread for discussion too.
Diffusion maps are being used in scRNA-seq and microbial ecology. Would love to use this package in my research (with proper citing of course). I was originally going to use https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.diffmap.html but I really like the sklearn API you developed. In particular the ability to .fit
, .fit_transform
, and .transform
methods as it's essential for what I'm trying to do (especially the .transform
method). Ideally, I'd like to be able to build a model then pickle it to use later.
If you don't have time to incorporate new features, I'm definitely willing to give this implementation a shot! If you have any insight on which code I'll need to adapt that would be greatly appreciated.
This might be a poor exercise but I'm trying to understand the methods of paper and if it makes sense to adapt my linear-based workflow with PCA to non-linear manifold methods; thought trying out diffusion maps would be worth a shot.
I'm trying to understand how to interpret the results from a diffusion map. The iris dataset is definitely not the best toy dataset but thought I would still be able to see some relationships.
I have a few questions:
Apologies if these questions are naive, I'm coming from microbial ecology and trying to understand the methods of a paper that I did not write.
Here's my code:
In this example, I'm seeing that they are excluding the first embedding: https://www.linkedin.com/pulse/diffusion-maps-unveiling-geometry-high-dimensional-data-yeshwanth-n-qrsfc/