Closed PeteHaitch closed 2 years ago
For BioC 3.15, one option would be to replace https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L286-L287
with explicit coercion of the input to a dense matrix
muraro.rankings <- AUCell_buildRankings(as.matrix(muraro.mat),
plotStats=FALSE, verbose=FALSE)
I made this change locally ('local output with as.matrix()
'[^1]) and compared the outputs to what we currently have in https://bioconductor.org/books/3.15/OSCA.basic/cell-type-annotation.html ('current output').
Things looked okay up until https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L255-L258
at which point I noticed differences in some of the overlaid densities curves, most notably the dashed curves for the 'interneurons' (pink) and 'pyramidal CA1' and 'pyrimidal SS' (pink and red). There are also some other cosmetic differences such as different tick marks and values on the y-axes.
[^1]: These results come from rendering just cell-annotations.Rmd
file (via the 'knit' button in RStudio) rather than rendering the whole book.
as.matrix()
(BioC 3.15)There are some other slight differences between 'current' and 'local copy with as.matrix()
', but I think these are unrelated to AUCell:
Some slight different numbers of genes in example GO terms from the goanna()
analysis in https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L346-L361
I presume this is due to updates to the annotations.
There are consequent small differences in dim(aggregated)
, specifically the number of rows, as computed in https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L408-L413
I presume this is due to the aforementioned slight differences in the gene sets because rows are gene sets in aggregated
dim(aggregated) # rows are gene sets, columns are cells
## [1] 22607 2772
as.matrix()
(BioC 3.15)dim(aggregated) # rows are gene sets, columns are cells
## [1] 22657 2772
I also ran a version locally where https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L286-L287
was changed to use the splitByBlocks=TRUE
argument
muraro.rankings <- AUCell_buildRankings(muraro.mat,
plotStats=FALSE, verbose=FALSE, splitByBlocks=TRUE)
This is as advised by the AUCell authors for sparse matrix inputs.
I made this change locally ('local output with splitByBlocks=TRUE
'[^1]) and compared the outputs to what we currently have in https://bioconductor.org/books/3.15/OSCA.basic/cell-type-annotation.html ('current output').
[^1]: These results come from rendering just cell-annotations.Rmd
file (via the 'knit' button in RStudio) rather than rendering the whole book.
Again, things looked okay up until https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L255-L258
splitByBlocks=TRUE
(BioC 3.15)In fact, these outputs look the same to me as those in Local output with as.matrix()
(BioC 3.15)
I now realise that these plots are made before the problematic lines of https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L286-L287
so this change in plotting behaviour is unrelated to the issue of sparse matrix support in AUCell_buildRankings()
.
However, as previously noted, the muraro.rankings
i.e. the output of
https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L286-L287
are actually different depending on whether the input is explicitly coerced to a dense matrix or the splitByBlocks=TRUE
argument is used with a sparse matrix input.
> muraro.rankings.as.matrix<- AUCell_buildRankings(
+ as.matrix(muraro.mat),
+ plotStats = FALSE,
+ verbose = FALSE)
>
> muraro.rankings.splitByBlocks <- AUCell_buildRankings(
+ muraro.mat,
+ plotStats = FALSE,
+ verbose = FALSE,
+ splitByBlocks = TRUE)
>
> muraro.rankings.as.matrix
Ranking for 16940 genes (rows) and 2079 cells (columns).
Top-left corner of the ranking:
cells
genes D28-1_1 D28-1_2 D28-1_4 D28-1_13 D28-1_15
A1BG-AS1 12375 7755 15108 16914 10106
A1BG 6971 13236 5730 10117 14253
A1CF 538 9809 1264 10216 716
A2M-AS1 15103 16274 14683 13022 12074
A2ML1 5286 12958 12528 6641 16138
A2M 15909 1434 10091 13875 11194
>
> muraro.rankings.splitByBlocks
Ranking for 16940 genes (rows) and 2079 cells (columns).
Top-left corner of the ranking:
cells
genes D28-1_1 D28-1_2 D28-1_4 D28-1_13 D28-1_15
A1BG-AS1 10989 16401 14765 12735 14289
A1BG 7375 11922 5759 10930 9983
A1CF 521 7797 1199 6793 691
A2M-AS1 13144 7071 16628 6286 6671
A2ML1 5443 12637 9928 5721 12270
A2M 9706 1331 15394 14266 15242
Yet the subsequent heatmaps used in the book look the very similar despite these differences in output (which ultimately influence the inputs to the heatmaps).
splitByBlocks=TRUE
(BioC 3.15)In a bit more detail, here's a reprex showing that AUCell::AUCell_buildRankings()
, as applied to the Muraro dataset gives different results for dense or sparse matrix input (BioC 3.15).
But the heatmaps shown in OSCA will look very similar regardless of whether it's dense or sparse input because the differences in output are subtle.
library(rebook)
suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(GSEABase))
library(AUCell)
extractFromPackage("muraro-pancreas.Rmd", package="OSCA.workflows",
chunk="normalization", objects="sce.muraro")
#> <button class="rebook-collapse">View set-up code ([Workflow Chapter 6](http://bioconductor.org/books/3.15/OSCA.workflows/muraro-human-pancreas-cel-seq.html#muraro-human-pancreas-cel-seq))</button>
#> <div class="rebook-content">
#>
#> ```r
#> #--- loading ---#
#> library(scRNAseq)
#> sce.muraro <- MuraroPancreasData()
#>
#> #--- gene-annotation ---#
#> library(AnnotationHub)
#> edb <- AnnotationHub()[["AH73881"]]
#> gene.symb <- sub("__chr.*$", "", rownames(sce.muraro))
#> gene.ids <- mapIds(edb, keys=gene.symb,
#> keytype="SYMBOL", column="GENEID")
#>
#> # Removing duplicated genes or genes without Ensembl IDs.
#> keep <- !is.na(gene.ids) & !duplicated(gene.ids)
#> sce.muraro <- sce.muraro[keep,]
#> rownames(sce.muraro) <- gene.ids[keep]
#>
#> #--- quality-control ---#
#> library(scater)
#> stats <- perCellQCMetrics(sce.muraro)
#> qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
#> batch=sce.muraro$donor, subset=sce.muraro$donor!="D28")
#> sce.muraro <- sce.muraro[,!qc$discard]
#>
#> #--- normalization ---#
#> library(scran)
#> set.seed(1000)
#> clusters <- quickCluster(sce.muraro)
#> sce.muraro <- computeSumFactors(sce.muraro, clusters=clusters)
#> sce.muraro <- logNormCounts(sce.muraro)
#> ```
#>
#> </div>
# Pruning out unknown or unclear labels.
sce.muraro <- sce.muraro[,!is.na(sce.muraro$label) &
sce.muraro$label!="unclear"]
# Downloading the signatures and caching them locally.
library(BiocFileCache)
#> Loading required package: dbplyr
bfc <- BiocFileCache(ask=FALSE)
scsig.path <- bfcrpath(bfc, file.path("http://software.broadinstitute.org",
"gsea/msigdb/supplemental/scsig.all.v1.0.symbols.gmt"))
scsigs <- getGmt(scsig.path)
muraro.mat <- counts(sce.muraro)
rownames(muraro.mat) <- rowData(sce.muraro)$symbol
runAUCell <- function(x, scsigs) {
if (is.matrix(x)) {
rankings <- AUCell_buildRankings(
x,
plotStats = FALSE,
verbose = FALSE)
} else if (is(x, "sparseMatrix")) {
rankings <- AUCell_buildRankings(
x,
plotStats = FALSE,
verbose = FALSE,
splitByBlocks = TRUE)
}
# Applying MsigDB to the Muraro dataset, because it's human:
AUCell_calcAUC(scsigs, rankings)
}
# Run AUCell on the same data stored as (1) dense matrix or (2) sparse matrix.
dense <- suppressMessages(runAUCell(as.matrix(muraro.mat), scsigs))
sparse <- suppressMessages(runAUCell(muraro.mat, scsigs))
# These return different results.
all.equal(dense, sparse)
#> [1] "Attributes: < Component \"assays\": Attributes: < Component \"data\": Attributes: < Component \"listData\": Component \"AUC\": Mean relative difference: 0.02188224 > > >"
# Specifically, the AUC values (the key output of AUCell!) are different.
all.equal(getAUC(dense), getAUC(sparse))
#> [1] "Mean relative difference: 0.02188224"
# Consequently, assigning cells based on the max AUC, as done in OSCA, gives
# different results for a small number of cells.
all.equal(
rownames(dense)[max.col(t(getAUC(dense)))],
rownames(sparse)[max.col(t(getAUC(sparse)))])
#> [1] "21 string mismatches"
Created on 2022-10-13 with reprex v2.0.2
To summarise (and apologies if you've already reached these conclusions):
AUCell::AUCell_buildRankings()
.AUCell::AUCell_buildRankings()
is unreliable.
a. It gives different results for dense and sparse inputs.
b. It does not automatically/gracefully handle sparse inputs as it should.(2) is annoying, but I'm inclined to ignore it. @Alanocallaghan Please let me know your thoughts on ignoring (1) and (2).
(3a) is definitely a problem, but the AUCell authors don't seem inclined to fix this. My assumption (hope?) is that the calculations for dense inputs are correct/reliable, so I'm inclined to use those.
(3b) is the killer and is what is preventing a clean build of OSCA.basic in both BioC 3.15 and 3.16.
I think our options to fix (3b) are:
AUCell::AUCell_buildRankings()
to a dense matrix via as.matrix()
.splitByBlocks
to the problematic call to AUCell::AUCell_buildRankings()
.I'm inclined to go with (2) and perhaps move to (1) during the next development cycle but could be swayed another way.
@Alanocallaghan Please let me know your thoughts.
Very thorough thanks Peter. I think you're right on all points; I am a bit worried by conclusion 3) but I don't really like removing a section without replacing it. As such I think option 2) for now and to consider option 1) next release is a good choice. Particularly as it would give us more time to evaluate options to replace it.
Thanks Alan. I've made the change in BioC 3.15 (https://github.com/OSCA-source/OSCA.basic/commit/6d09e8706d984e17ee1c24b09eef428a2e153585) and BioC 3.16 (4fb456a127ae25ee9dfb051a3d95ddbe1585c870). It passes for me on 3.15 and I've just started a check on 3.16 on my machine (macOS Intel) but I'm heading to bed so I won't know until the morning. But I'm fairly hopeful it'll pass there, too.
In any case, I've pushed both to the BioC git server with a version bump to trigger a build. Fingers crossed!
Following my overnight check on my laptop I found that another small tweak was required in BioC 3.16 (e0b0c9a110a92e85eec75acfe61e7997069b1504). I've now pushed this to BioC git server with a version bump. I'll keep an eye on the build reports over the weekend.
Success! Now building and passing checks in BioC 3.16 as well as BioC 3.15
Trying to bring this issue to a close for both BioC 3.15 and 3.16. The issue was initially noted in https://github.com/OSCA-source/OSCA.basic/issues/4#issuecomment-1133066447 It was followed up in https://github.com/aertslab/AUCell/issues/28, but the discussion in that issue and lack of follow-up by the AUCell authors leaves me with little hope that they will fix the underlying problem.
The simplest fix, albeit somewhat drastic, is to simply remove usage of AUCell from OSCA, as proposed/implemented in https://github.com/OSCA-source/OSCA.basic/pull/7.
I'll document here some attempts to work around (if not fix) the issue.