Closed biobug16 closed 4 years ago
This is indeed strange... I ran the following minimal example:
> library(muscat)
> library(scater)
> library(Seurat)
>
> data("sce")
> sce <- logNormCounts(sce)
> pbs_muscat <- aggregateData(sce,
+ assay = "logcounts",
+ by = "cluster_id",
+ fun = "mean")
>
> so <- as.Seurat(sce,
+ counts = "counts",
+ data = "logcounts")
> Idents(so) <- sce$cluster_id
> pbs_Seurat <- AverageExpression(so, verbose = FALSE)
>
> pbs_muscat <- assay(pbs_muscat)
> pbs_Seurat <- as.matrix(pbs_Seurat$RNA)
>
> pbs_muscat[1:5, 1:5]
B cells CD14+ Monocytes CD4 T cells CD8 T cells FCGR3A+ Monocytes
HES4 0.03451104 0.2746553 0.01533508 0.04723727 0.2711576
ISG15 1.97664108 3.1895385 1.71264009 1.80143888 3.1022842
AURKAIP1 0.31255648 0.2983547 0.19292368 0.24780775 0.2630088
MRPL20 0.16645940 0.1238642 0.24418788 0.18983882 0.1247589
SSU72 0.37203387 0.1999495 0.21077588 0.15557740 0.2051142
> pbs_Seurat[1:5, 1:5]
B cells CD14+ Monocytes CD4 T cells CD8 T cells FCGR3A+ Monocytes
HES4 0.06972199 0.6925305 0.03149638 0.1097811 0.6254975
ISG15 29.69192097 282.2793200 21.14922297 21.5283969 155.9945484
AURKAIP1 0.71541977 0.5164700 0.46292999 0.7070776 0.4567129
MRPL20 0.41384871 0.2139833 0.60588928 0.5057150 0.2024755
SSU72 0.86936154 0.3484695 0.48057468 0.3808275 0.3440797
which is not the same. However,
> head(rowMeans(logcounts(sce[, sce$cluster_id == "B cells"])))
HES4 ISG15 AURKAIP1 MRPL20 SSU72 RER1
0.03451104 1.97664108 0.31255648 0.16645940 0.37203387 0.27547975
which is the 1st column of muscat
's results. Besides, there are plenty of unit-tests checking that aggregateData
returns the correct results for various assay
, fun
arguments, NA values etc. etc.
So, as you said, I would expect these are the same but it appears they are not. However, I am unable to figure out at the moment what exactly it is Seurat
is computing. In short: I am certain (...as much as any developer can be...) that muscat
is doing it "right" in terms of what we want: Average expression (here, logcounts
). The question is: What is Seurat
doing?
Okay, I think I figured it out... From the code here: https://github.com/satijalab/seurat/blob/master/R/utilities.R#L245, it looks like Seurat
's AverageExpression
is taking rowMeans(expm1(data)) / rowMeans(exp(data)-1)
, thus reverting the log-transformation & addition of a pseudo-count (default 1 for scater
's logNormCounts
). However, taking the log -> mean or log -> exp -> mean -> log is not the same.
> x <- runif(10)
> mean(x)
[1] 0.5294414
> log1p(mean(expm1(x)))
[1] 0.5550143
The expm1
(what Seurat
does) is in my opinion also problematic, as logcounts
could really be anything, and there is no guarantee these are natural-log-transformed data with a pseudo-count of 1. E.g., in scater::logNormCounts
one could alter the pseudo-count to say 2. Or the data are VST residuals or anything else that is "expression-like"...
Admittedly, I am not sure what effect this really has (both statistically and/or biologically), but if you want "the average of what you put in" I'd say muscat
is doing just that.
Hi @markrobinsonuzh , @HelenaLC , @plger ,
Can any of you please explain how does muscat's aggregateData() with fun = "mean" differ from Seurat's AverageExpression() and which one is better to get average expression of genes for each cluster?
Actually my doubt is Seurat's AverageExpression() should exactly be the same as muscat's aggregateData() with fun="mean". As both are just the average gene expression values from a group of cells from each of the identified cell-type or cluster. But when I checked it in my data, it was not the case.
May be I am getting it completely wrong. Can someone please correct me if I am missing something?
Thanks in advance.