HelenaLC / muscat

Multi-sample multi-group scRNA-seq analysis tools
166 stars 33 forks source link

Question: How does aggregateData() and AverageExpression() differ from each other? #35

Closed biobug16 closed 4 years ago

biobug16 commented 4 years ago

Hi @markrobinsonuzh , @HelenaLC , @plger ,

Can any of you please explain how does muscat's aggregateData() with fun = "mean" differ from Seurat's AverageExpression() and which one is better to get average expression of genes for each cluster?

Actually my doubt is Seurat's AverageExpression() should exactly be the same as muscat's aggregateData() with fun="mean". As both are just the average gene expression values from a group of cells from each of the identified cell-type or cluster. But when I checked it in my data, it was not the case.

May be I am getting it completely wrong. Can someone please correct me if I am missing something?

Thanks in advance.

HelenaLC commented 4 years ago

This is indeed strange... I ran the following minimal example:

> library(muscat)
> library(scater)
> library(Seurat)
> 
> data("sce")
> sce <- logNormCounts(sce)
> pbs_muscat <- aggregateData(sce, 
+     assay = "logcounts",
+     by = "cluster_id",
+     fun = "mean")
> 
> so <- as.Seurat(sce, 
+     counts = "counts",
+     data = "logcounts")
> Idents(so) <- sce$cluster_id
> pbs_Seurat <- AverageExpression(so, verbose = FALSE)
> 
> pbs_muscat <- assay(pbs_muscat)
> pbs_Seurat <- as.matrix(pbs_Seurat$RNA)
> 
> pbs_muscat[1:5, 1:5]
            B cells CD14+ Monocytes CD4 T cells CD8 T cells FCGR3A+ Monocytes
HES4     0.03451104       0.2746553  0.01533508  0.04723727         0.2711576
ISG15    1.97664108       3.1895385  1.71264009  1.80143888         3.1022842
AURKAIP1 0.31255648       0.2983547  0.19292368  0.24780775         0.2630088
MRPL20   0.16645940       0.1238642  0.24418788  0.18983882         0.1247589
SSU72    0.37203387       0.1999495  0.21077588  0.15557740         0.2051142
> pbs_Seurat[1:5, 1:5]
             B cells CD14+ Monocytes CD4 T cells CD8 T cells FCGR3A+ Monocytes
HES4      0.06972199       0.6925305  0.03149638   0.1097811         0.6254975
ISG15    29.69192097     282.2793200 21.14922297  21.5283969       155.9945484
AURKAIP1  0.71541977       0.5164700  0.46292999   0.7070776         0.4567129
MRPL20    0.41384871       0.2139833  0.60588928   0.5057150         0.2024755
SSU72     0.86936154       0.3484695  0.48057468   0.3808275         0.3440797

which is not the same. However,

> head(rowMeans(logcounts(sce[, sce$cluster_id == "B cells"])))
      HES4      ISG15   AURKAIP1     MRPL20      SSU72       RER1 
0.03451104 1.97664108 0.31255648 0.16645940 0.37203387 0.27547975 

which is the 1st column of muscat's results. Besides, there are plenty of unit-tests checking that aggregateData returns the correct results for various assay, fun arguments, NA values etc. etc.

So, as you said, I would expect these are the same but it appears they are not. However, I am unable to figure out at the moment what exactly it is Seurat is computing. In short: I am certain (...as much as any developer can be...) that muscat is doing it "right" in terms of what we want: Average expression (here, logcounts). The question is: What is Seurat doing?

HelenaLC commented 4 years ago

Okay, I think I figured it out... From the code here: https://github.com/satijalab/seurat/blob/master/R/utilities.R#L245, it looks like Seurat's AverageExpression is taking rowMeans(expm1(data)) / rowMeans(exp(data)-1), thus reverting the log-transformation & addition of a pseudo-count (default 1 for scater's logNormCounts). However, taking the log -> mean or log -> exp -> mean -> log is not the same.

> x <- runif(10)
> mean(x)
[1] 0.5294414
> log1p(mean(expm1(x)))
[1] 0.5550143

The expm1 (what Seurat does) is in my opinion also problematic, as logcounts could really be anything, and there is no guarantee these are natural-log-transformed data with a pseudo-count of 1. E.g., in scater::logNormCounts one could alter the pseudo-count to say 2. Or the data are VST residuals or anything else that is "expression-like"... Admittedly, I am not sure what effect this really has (both statistically and/or biologically), but if you want "the average of what you put in" I'd say muscat is doing just that.