brianhie / scanorama

Panoramic stitching of single cell data
http://scanorama.csail.mit.edu
MIT License
261 stars 49 forks source link

log-normalized or raw gene expression counts as input to Scanorama #112

Closed antonioggsousa closed 2 years ago

antonioggsousa commented 2 years ago

Dear @brianhie,

Thank you and your colleagues for developing scanorama! I'm testing it through a few "dummy" examples and I'm delighted with the results.

I read the paper as well as one of the tutorials mentioned in the github README.md file.

In order to test scanorama, I run it with a few toy data sets in addition to one example data set highlighted in the scanorama repository. When I started with the toy data sets I provided scaled counts to scanorama by mistake due to the less familiarity with scanpy, anndata and python in general. Therefore, I checked the paper and the tutorial again to find which input scanorama requires. The tutorial mentions at some point log-normalized gene expression counts whereas the paper mentions that l2 normalization is performed internally. If I understood correctly it aims to standardize the cells to the same scale, i.e., to unit norm. Thus, its application is not necessarily dependent on previous normalization. Then, my question is: which should ideally be the input to scanorama, log-normalized or raw counts?

Regarding the tests that I've performed, the results obtained with raw counts seem slightly better than the ones obtained with log-normalized counts.

Another small doubt that I've is related with the integration result, i.e., X_scanorama, that scanorama provides. My understanding is that this low-dimensional embedding is intended to be used for UMAP/t-SNE estimation and visualization (among others downstream tasks) based on the tutorial mentioned above and the paper. For instance, in the tutorial they calculate a neighborhood graph and UMAP with this result:

# tsne and umap
sc.pp.neighbors(adata, n_pcs =50, use_rep = "Scanorama")
sc.tl.umap(adata)
sc.tl.tsne(adata, n_pcs = 50, use_rep = "Scanorama")

If X_scanorama is a low dimensional embedding should we plot this directly?

Thank you and sorry for the off topic question!

Best regards,

António

brianhie commented 2 years ago

Hi @antonioggsousa, this analysis: https://www.nature.com/articles/s41592-021-01336-8 reports that Scanorama works best with log normalization and scaling (they use Scanpy).

Yes, the output of Scanorama is the low dimensional embedding, which is used to compute the k-nearest neighbors graph, which is then used for visualization and clustering.