brianhie / scanorama

Panoramic stitching of single cell data
http://scanorama.csail.mit.edu
MIT License
261 stars 49 forks source link

Can scanorama.correct() be used on raw data for batch correction #142

Closed tbrunetti closed 1 year ago

tbrunetti commented 1 year ago

Hi,

Thank you for this excellent software! I am trying to understand the scanorama.correct() function. Based on the paper and by looking at the scanpy docs for the API, it sounds like scanorama.correct() actually corrects the counts data by removing batch effects where the batch is defined by the user and that these new counts can be used in differential expression analysis between clusters, samples, etc.... which is great, since integration cannot account for the DE differences between technical artifacts. In my usual scRNA-seq workflows, I usually go through integration, which will only change the embeddings but not the counts. A couple questions:

  1. If scanorama.correct() produces new counts on all genes, does this need to be applied after normalization and integration, or can this be bypassed and used directly on the raw count data? I am not clear how the math is calculated behind the scenes on the correct() function to determine which is the best way to approach this.

  2. After applying correct() does the data need to be renormalized?

  3. Can I export this corrected matrix and use it in other alogrithms such as non-negative matrix factorization to identify gene signatures?

Thank you!

brianhie commented 1 year ago

Hi @tbrunetti, great question! See this discussion: https://github.com/brianhie/scanorama/discussions/85. To summarize, I would caution against actually interpreting the values of correct() as gene expression values -- they are transformed in such a way that geometric distances among cells is meaningful, but the individual values are probably not all that meaningful.