Closed dariotommasini closed 6 months ago
Hi Dario. I believe you can do sm.samap.leiden_clustering()
and then sm.samap.adata.obs
would contain the leiden clustering results.
sm
is the SAMAP object
samap
is a SAM object containing the stitched data
leiden_clustering()
computes leiden clustering on the SAM object.
sm.samap.adata
is the AnnData containing the stitched data in the SAM object
Let me know if that doesn't work for you.
Thanks, @atarashansky !
That worked for me. Just to clarify, I thought sm.samap.adata was the block-diagonal matrix with each species gene expression (in other words unstitched). Does that mean the stitched space is in the PCA layer of the adata object?
Yeah, it's a bit confusing. The stitched space is in the connectivities. The "stitched" PCA layer is an intermediate object used for generating the aligned graph by SAMAP internally using its fancy algos (lol). sm.samap.adata.obsp['connectivities']
is the primary output of SAMAP.
The (probably overly verbose) tutorial I wrote ages ago has some notes about the combined SAM object:
I see. So UMAP is being run on sm.samap.adata.obsp['connectivities']?
Right - we're using the part of UMAP that embeds graphs in 2D.
Great, thanks for the help!
Hi Alex,
Sorry to keep bothering you. Could you clarify this line in your methods?
"SAM outputs N1 and N2, which are directed adjacency matrices that encode k-nearest neighbor graphs for the two datasets, respectively." What are the two datasets in this case? I thought SAM was run on each dataset separately and is analogous to doing a standard clustering analysis like in seurat or scanpy. Why are there two adjacency matrices?
Oh, sorry for the confusion. SAM was run on each dataset separately, each run of which outputs an adjacency matrix. So akin to running Seurat/Scanpy on each dataset separately. There are two adjacency matrices because there are two datasets.
I see. Each SAM run returns one adjacency matrix, but you were talking about the case where we are aligning species i and species j, and SAM is run on each species so we get one adjacency matrix per species.
Is there a way to run SAMap until some stopping criteria is met? The three iterations work well in practice but I imagine it depends on the dataset too.
I haven't really evaluated how the number of iterations affects the performance. I would recommend just playing with the number of iterations - in my hands, it doesn't change much beyond a certain number of iterations.
Yes, that's what I've seen as well. Could you describe how the methods from the paper would be extended to the case of comparing multiple species rather than just two?
The Jupyter notebook tutorial in the repo shows an example of comparing three species. Take a look at that!
Thanks, Alex. I've done that vignette and it's very helpful. I understand how to perform the analysis with multiple species, but I am unsure of how the mathematical details described in the Elife paper methods is generalized to multiple species.
Right - a lot of the mathematical details in the elife paper can be generalized to work with an arbitrary number of species because they can be written as matrix multiplications. So the work to expand to multiple species required refactoring a lot of the operations to be more formal linear algebra. It's a bit too much work for me atm to describe all the equivalencies and generalizations, but I did confirm that the multi-species version of SAMAP is equivalent in output to the original version for mapping two species.
When you're mapping multiple species, though, there are some other considerations that were not relevant for the two-species mapping task. For example, when choosing the initial k-nearest neighbors for a cell, do you choose from the combined pool of cells in the other two species? Or do you choose k neighbors from one species and k neighbors from the other? it depends on what you're trying to accomplish. If the goal is to find "homologous" cell types in both species simultaneously, then the latter is better. If you would rather have the mapping reflect evolutionary distance and are comfortable with two species preferentially mapping to each other, leaving the third species disconnected, for some cell types, then it's better to choose neighbors from the combined pool of cells. This is controlled via the pairwise
argument in SAMAP.
Sorry I can't be more helpful in this! If you have any specific questions about particular steps in the algorithm, I can dig into more detail.
No worries, this is helpful. The more I begin to understand the Elife methods, I'll start understanding how it works for multiple species. Thanks for all your help!
Hi @atarashansky ,
Can you confirm for me that the refined homology graph in sm.gnnm_refined is G from the methods? In other words, I would need to row-normalize the upper triangle of sm.gnnm_refined in order to generate the H matrices from the paper?
Also, what are indices a and b in these equations?
Thanks!
In the paper, the bipartite homology graph (original and refined) is G. G is an 2x2 block matrix with H
(the biadjacency matrix) on the off-diagonal blocks.
You shouldn't need to row-normalize anything as the biadjacency matrices have already been normalized.
ab
in the attached equation refers to the row,col indices of G.
For the SumNorm as well?
It doesn't seem like either the rows or columns sum to 1 in the H matrix directly pulled from G.
Oh, the sumnorm is a preprocessing step used to perform the feature translation from one species' column space to the other's. In this context, H is an intermediate mathematical object so you can ignore it.
I see. Thanks, Alex!
Could you explain the intuition behind row wise sumnorm? It seems to me that a weighted average would require column wise sumnorm for Hi. If you consider the simplest case:
I must be making an error somewhere.
Hi @atarashansky ,
Is there a way to compute leiden clusters in the stitched space? I see in the tutorial we can use leiden clusters to generate clusters within each species. I'm wondering if it's possible to cluster in the stitched manifold instead. Since we compute umap on the stitched manifold, I think we should be able to cluster the cells in the stitched manifold as well. Any ideas on how I could do this?
Thank you!