MarioniLab / scran

Clone of the Bioconductor repository for the scran package.
https://bioconductor.org/packages/devel/bioc/html/scran.html
40 stars 22 forks source link

Exporting the batch corrected data #28

Closed gouthamatla closed 5 years ago

gouthamatla commented 5 years ago

Hi,

I am trying the MNN approach following the tutorial at http://bioconductor.riken.jp/packages/release/workflows/vignettes/simpleSingleCell/inst/doc/work-5-mnn.html

I would like to know how can I export the batch corrected expression values from sce. Are sce@assays$data$logcounts the batch corrected values ?

LTLA commented 5 years ago

No. As you can see from the code, logcounts only contains the original expression values in omat:

sce <- SingleCellExperiment(list(logcounts=omat))

The corrected values are stored in the reducedDims slot:

reducedDim(sce, "MNN") <- mnn.out$corrected

These are not gene expression values, but rather, batch-corrected low-dimensional coordinates for each cell. These corrected values can be used for clustering, visualization and other cell-based procedures.

If you want per-gene corrected expression values, you need to multiply the per-cell corrected values by the rotation vector as described in the "Examples" for ?fastMNN. This is not done by default as the output matrix is dense, which can use too much memory when you have sparse data for many cells. The per-gene corrected values should not be used for much - see my comments at the bottom of this section. It is preferable to go back to the original counts when doing per-gene analyses, e.g., for differential expression.

gouthamatla commented 5 years ago

Hi Aaron,

Thanks for the quick reply.

I am thinking to use knn approach on corrected data to aggregate expression values from cells that are close to each other and run a co-expression analysis.

Do you have any suggestions for co-expression analysis if I am pooling data from different studies ?

-- Goutham A

LTLA commented 5 years ago

I assume you are referring to co-expression of gene pairs, based on pairwise correlations between gene pairs.

I am thinking to use knn approach on corrected data to aggregate expression values from cells that are close to each other and run a co-expression analysis.

This sounds like a bad idea. The correction involves a PCA step, and undoing the PCA will introduce correlations on genes that are otherwise independent. For example:

set.seed(10001)
y <- matrix(rnorm(100000), nrow=10) # each row is a gene
cor(y[1,], y[2,]) # -0.002
out <- prcomp(t(y), rank=5) # first 5 PCs.
reconstructed <- tcrossprod(out$rotation, out$x)
cor(reconstructed[1,], reconstructed[2,]) # -0.2

So, even though the genes were actually independent, running the PCA and using the reconstructed low-rank matrix will introduce a correlation that is 100 times larger than what it was! So, this is already not good; and it will only get worse with the MNN correction and your k-NN step. All of these steps run the risk of artificially inflating correlations by forcing cells (and their expression profiles) to fit in with other cells.

Do you have any suggestions for co-expression analysis if I am pooling data from different studies ?

Do it separately in each study and then perform a meta-analysis. This completely protects you from batch effects between studies. If you must analyze everything together, use blocking factors when estimating the correlation, e.g., in a (generalized) linear model. There is no need to work on the corrected values if you are doing gene-based inference.

gouthamatla commented 5 years ago

Thank very much for very useful insights. A met analysis sounds like a better and clean approach.

I was just worried that if I want to do a gene-pair expression correlation across a cell-type of interest, I should account for batch effects if I am pooling data from different studies.

I will check GLM with blocking factors or a met-analysis.