atarashansky / SAMap

SAMap: Mapping single-cell RNA sequencing datasets from evolutionarily distant organisms.
MIT License
63 stars 19 forks source link

Clustering in stitched space #141

Closed dariotommasini closed 6 months ago

dariotommasini commented 6 months ago

Hi @atarashansky ,

Is there a way to compute leiden clusters in the stitched space? I see in the tutorial we can use leiden clusters to generate clusters within each species. I'm wondering if it's possible to cluster in the stitched manifold instead. Since we compute umap on the stitched manifold, I think we should be able to cluster the cells in the stitched manifold as well. Any ideas on how I could do this?

Thank you!

atarashansky commented 6 months ago

Hi Dario. I believe you can do sm.samap.leiden_clustering() and then sm.samap.adata.obs would contain the leiden clustering results.

sm is the SAMAP object samap is a SAM object containing the stitched data leiden_clustering() computes leiden clustering on the SAM object. sm.samap.adata is the AnnData containing the stitched data in the SAM object

Let me know if that doesn't work for you.

dariotommasini commented 6 months ago

Thanks, @atarashansky !

That worked for me. Just to clarify, I thought sm.samap.adata was the block-diagonal matrix with each species gene expression (in other words unstitched). Does that mean the stitched space is in the PCA layer of the adata object?

atarashansky commented 6 months ago

Yeah, it's a bit confusing. The stitched space is in the connectivities. The "stitched" PCA layer is an intermediate object used for generating the aligned graph by SAMAP internally using its fancy algos (lol). sm.samap.adata.obsp['connectivities'] is the primary output of SAMAP.

atarashansky commented 6 months ago

The (probably overly verbose) tutorial I wrote ages ago has some notes about the combined SAM object:

image
dariotommasini commented 6 months ago

I see. So UMAP is being run on sm.samap.adata.obsp['connectivities']?

atarashansky commented 6 months ago

Right - we're using the part of UMAP that embeds graphs in 2D.

dariotommasini commented 6 months ago

Great, thanks for the help!

dariotommasini commented 5 months ago

Hi Alex,

Sorry to keep bothering you. Could you clarify this line in your methods?

"SAM outputs N1 and N2, which are directed adjacency matrices that encode k-nearest neighbor graphs for the two datasets, respectively." What are the two datasets in this case? I thought SAM was run on each dataset separately and is analogous to doing a standard clustering analysis like in seurat or scanpy. Why are there two adjacency matrices?

atarashansky commented 5 months ago

Oh, sorry for the confusion. SAM was run on each dataset separately, each run of which outputs an adjacency matrix. So akin to running Seurat/Scanpy on each dataset separately. There are two adjacency matrices because there are two datasets.

dariotommasini commented 5 months ago

I see. Each SAM run returns one adjacency matrix, but you were talking about the case where we are aligning species i and species j, and SAM is run on each species so we get one adjacency matrix per species.

dariotommasini commented 5 months ago

Is there a way to run SAMap until some stopping criteria is met? The three iterations work well in practice but I imagine it depends on the dataset too.

atarashansky commented 5 months ago

I haven't really evaluated how the number of iterations affects the performance. I would recommend just playing with the number of iterations - in my hands, it doesn't change much beyond a certain number of iterations.

dariotommasini commented 5 months ago

Yes, that's what I've seen as well. Could you describe how the methods from the paper would be extended to the case of comparing multiple species rather than just two?

atarashansky commented 5 months ago

The Jupyter notebook tutorial in the repo shows an example of comparing three species. Take a look at that!

dariotommasini commented 5 months ago

Thanks, Alex. I've done that vignette and it's very helpful. I understand how to perform the analysis with multiple species, but I am unsure of how the mathematical details described in the Elife paper methods is generalized to multiple species.

atarashansky commented 5 months ago

Right - a lot of the mathematical details in the elife paper can be generalized to work with an arbitrary number of species because they can be written as matrix multiplications. So the work to expand to multiple species required refactoring a lot of the operations to be more formal linear algebra. It's a bit too much work for me atm to describe all the equivalencies and generalizations, but I did confirm that the multi-species version of SAMAP is equivalent in output to the original version for mapping two species.

When you're mapping multiple species, though, there are some other considerations that were not relevant for the two-species mapping task. For example, when choosing the initial k-nearest neighbors for a cell, do you choose from the combined pool of cells in the other two species? Or do you choose k neighbors from one species and k neighbors from the other? it depends on what you're trying to accomplish. If the goal is to find "homologous" cell types in both species simultaneously, then the latter is better. If you would rather have the mapping reflect evolutionary distance and are comfortable with two species preferentially mapping to each other, leaving the third species disconnected, for some cell types, then it's better to choose neighbors from the combined pool of cells. This is controlled via the pairwise argument in SAMAP.

Sorry I can't be more helpful in this! If you have any specific questions about particular steps in the algorithm, I can dig into more detail.

dariotommasini commented 5 months ago

No worries, this is helpful. The more I begin to understand the Elife methods, I'll start understanding how it works for multiple species. Thanks for all your help!

dariotommasini commented 5 months ago

Hi @atarashansky ,

Can you confirm for me that the refined homology graph in sm.gnnm_refined is G from the methods? In other words, I would need to row-normalize the upper triangle of sm.gnnm_refined in order to generate the H matrices from the paper?

Also, what are indices a and b in these equations?

Screenshot 2024-02-01 at 11 40 33 AM Screenshot 2024-02-01 at 11 40 50 AM

Thanks!

atarashansky commented 5 months ago

In the paper, the bipartite homology graph (original and refined) is G. G is an 2x2 block matrix with H (the biadjacency matrix) on the off-diagonal blocks.

You shouldn't need to row-normalize anything as the biadjacency matrices have already been normalized.

ab in the attached equation refers to the row,col indices of G.

dariotommasini commented 5 months ago

For the SumNorm as well?

Screenshot 2024-02-01 at 1 31 52 PM

It doesn't seem like either the rows or columns sum to 1 in the H matrix directly pulled from G.

atarashansky commented 5 months ago

Oh, the sumnorm is a preprocessing step used to perform the feature translation from one species' column space to the other's. In this context, H is an intermediate mathematical object so you can ignore it.

dariotommasini commented 5 months ago

I see. Thanks, Alex!

Could you explain the intuition behind row wise sumnorm? It seems to me that a weighted average would require column wise sumnorm for Hi. If you consider the simplest case:

image

I must be making an error somewhere.