Cell type annotation chapter broken because of breaking changes in AUCell

PeteHaitch commented 2 years ago

Trying to bring this issue to a close for both BioC 3.15 and 3.16. The issue was initially noted in https://github.com/OSCA-source/OSCA.basic/issues/4#issuecomment-1133066447 It was followed up in https://github.com/aertslab/AUCell/issues/28, but the discussion in that issue and lack of follow-up by the AUCell authors leaves me with little hope that they will fix the underlying problem.

The simplest fix, albeit somewhat drastic, is to simply remove usage of AUCell from OSCA, as proposed/implemented in https://github.com/OSCA-source/OSCA.basic/pull/7.

I'll document here some attempts to work around (if not fix) the issue.

PeteHaitch commented 2 years ago

For BioC 3.15, one option would be to replace https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L286-L287

with explicit coercion of the input to a dense matrix

muraro.rankings <- AUCell_buildRankings(as.matrix(muraro.mat),
    plotStats=FALSE, verbose=FALSE)

I made this change locally ('local output with as.matrix()'[^1]) and compared the outputs to what we currently have in https://bioconductor.org/books/3.15/OSCA.basic/cell-type-annotation.html ('current output').

Things looked okay up until https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L255-L258

at which point I noticed differences in some of the overlaid densities curves, most notably the dashed curves for the 'interneurons' (pink) and 'pyramidal CA1' and 'pyrimidal SS' (pink and red). There are also some other cosmetic differences such as different tick marks and values on the y-axes.

[^1]: These results come from rendering just cell-annotations.Rmd file (via the 'knit' button in RStudio) rather than rendering the whole book.

Current output (BioC 3.15)

Local output with `as.matrix()` (BioC 3.15)

PeteHaitch commented 2 years ago

There are some other slight differences between 'current' and 'local copy with as.matrix()', but I think these are unrelated to AUCell:

Some slight different numbers of genes in example GO terms from the goanna() analysis in https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L346-L361

I presume this is due to updates to the annotations.

There are consequent small differences in dim(aggregated), specifically the number of rows, as computed in https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L408-L413

I presume this is due to the aforementioned slight differences in the gene sets because rows are gene sets in aggregated

Current output (BioC 3.15)

dim(aggregated) # rows are gene sets, columns are cells
## [1] 22607  2772

Local output with `as.matrix()` (BioC 3.15)

dim(aggregated) # rows are gene sets, columns are cells
## [1] 22657  2772

PeteHaitch commented 2 years ago

I also ran a version locally where https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L286-L287

was changed to use the splitByBlocks=TRUE argument

muraro.rankings <- AUCell_buildRankings(muraro.mat,
    plotStats=FALSE, verbose=FALSE, splitByBlocks=TRUE)

This is as advised by the AUCell authors for sparse matrix inputs.

I made this change locally ('local output with splitByBlocks=TRUE'[^1]) and compared the outputs to what we currently have in https://bioconductor.org/books/3.15/OSCA.basic/cell-type-annotation.html ('current output').

[^1]: These results come from rendering just cell-annotations.Rmd file (via the 'knit' button in RStudio) rather than rendering the whole book.

Again, things looked okay up until https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L255-L258

Local output with `splitByBlocks=TRUE` (BioC 3.15)

In fact, these outputs look the same to me as those in Local output with as.matrix() (BioC 3.15)

PeteHaitch commented 2 years ago

I now realise that these plots are made before the problematic lines of https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L286-L287 so this change in plotting behaviour is unrelated to the issue of sparse matrix support in AUCell_buildRankings().

PeteHaitch commented 2 years ago

However, as previously noted, the muraro.rankings i.e. the output of https://github.com/OSCA-source/OSCA.basic/blob/f7910c6d755571e5efab1828427f0b98bfb5d423/inst/book/cell-annotation.Rmd#L286-L287

are actually different depending on whether the input is explicitly coerced to a dense matrix or the splitByBlocks=TRUE argument is used with a sparse matrix input.

> muraro.rankings.as.matrix<- AUCell_buildRankings(
+   as.matrix(muraro.mat),
+   plotStats = FALSE, 
+   verbose = FALSE)
> 
> muraro.rankings.splitByBlocks <- AUCell_buildRankings(
+   muraro.mat,
+   plotStats = FALSE, 
+   verbose = FALSE, 
+   splitByBlocks = TRUE)
> 
> muraro.rankings.as.matrix
Ranking for 16940 genes (rows) and 2079 cells (columns).

Top-left corner of the ranking:
          cells
genes      D28-1_1 D28-1_2 D28-1_4 D28-1_13 D28-1_15
  A1BG-AS1   12375    7755   15108    16914    10106
  A1BG        6971   13236    5730    10117    14253
  A1CF         538    9809    1264    10216      716
  A2M-AS1    15103   16274   14683    13022    12074
  A2ML1       5286   12958   12528     6641    16138
  A2M        15909    1434   10091    13875    11194
> 
> muraro.rankings.splitByBlocks
Ranking for 16940 genes (rows) and 2079 cells (columns).

Top-left corner of the ranking:
          cells
genes      D28-1_1 D28-1_2 D28-1_4 D28-1_13 D28-1_15
  A1BG-AS1   10989   16401   14765    12735    14289
  A1BG        7375   11922    5759    10930     9983
  A1CF         521    7797    1199     6793      691
  A2M-AS1    13144    7071   16628     6286     6671
  A2ML1       5443   12637    9928     5721    12270
  A2M         9706    1331   15394    14266    15242

Yet the subsequent heatmaps used in the book look the very similar despite these differences in output (which ultimately influence the inputs to the heatmaps).

Current output (BioC 3.15)

Local output with `splitByBlocks=TRUE` (BioC 3.15)

PeteHaitch commented 2 years ago

In a bit more detail, here's a reprex showing that AUCell::AUCell_buildRankings(), as applied to the Muraro dataset gives different results for dense or sparse matrix input (BioC 3.15). But the heatmaps shown in OSCA will look very similar regardless of whether it's dense or sparse input because the differences in output are subtle.

library(rebook)
suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(GSEABase))
library(AUCell)

extractFromPackage("muraro-pancreas.Rmd", package="OSCA.workflows",
                   chunk="normalization", objects="sce.muraro")
#> <button class="rebook-collapse">View set-up code ([Workflow Chapter 6](http://bioconductor.org/books/3.15/OSCA.workflows/muraro-human-pancreas-cel-seq.html#muraro-human-pancreas-cel-seq))</button>
#> <div class="rebook-content">
#> 
#> ```r
#> #--- loading ---#
#> library(scRNAseq)
#> sce.muraro <- MuraroPancreasData()
#> 
#> #--- gene-annotation ---#
#> library(AnnotationHub)
#> edb <- AnnotationHub()[["AH73881"]]
#> gene.symb <- sub("__chr.*$", "", rownames(sce.muraro))
#> gene.ids <- mapIds(edb, keys=gene.symb, 
#>     keytype="SYMBOL", column="GENEID")
#> 
#> # Removing duplicated genes or genes without Ensembl IDs.
#> keep <- !is.na(gene.ids) & !duplicated(gene.ids)
#> sce.muraro <- sce.muraro[keep,]
#> rownames(sce.muraro) <- gene.ids[keep]
#> 
#> #--- quality-control ---#
#> library(scater)
#> stats <- perCellQCMetrics(sce.muraro)
#> qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent",
#>     batch=sce.muraro$donor, subset=sce.muraro$donor!="D28")
#> sce.muraro <- sce.muraro[,!qc$discard]
#> 
#> #--- normalization ---#
#> library(scran)
#> set.seed(1000)
#> clusters <- quickCluster(sce.muraro)
#> sce.muraro <- computeSumFactors(sce.muraro, clusters=clusters)
#> sce.muraro <- logNormCounts(sce.muraro)
#> ```
#> 
#> </div>

# Pruning out unknown or unclear labels.
sce.muraro <- sce.muraro[,!is.na(sce.muraro$label) & 
                           sce.muraro$label!="unclear"]

# Downloading the signatures and caching them locally.
library(BiocFileCache)
#> Loading required package: dbplyr
bfc <- BiocFileCache(ask=FALSE)
scsig.path <- bfcrpath(bfc, file.path("http://software.broadinstitute.org",
                                      "gsea/msigdb/supplemental/scsig.all.v1.0.symbols.gmt"))
scsigs <- getGmt(scsig.path)

muraro.mat <- counts(sce.muraro)
rownames(muraro.mat) <- rowData(sce.muraro)$symbol

runAUCell <- function(x, scsigs) {
  if (is.matrix(x)) {
    rankings <- AUCell_buildRankings(
      x,
      plotStats = FALSE, 
      verbose = FALSE) 
  } else if (is(x, "sparseMatrix")) {
    rankings <- AUCell_buildRankings(
      x,
      plotStats = FALSE, 
      verbose = FALSE,
      splitByBlocks = TRUE) 
  }
  # Applying MsigDB to the Muraro dataset, because it's human:
  AUCell_calcAUC(scsigs, rankings)
}

# Run AUCell on the same data stored as (1) dense matrix or (2) sparse matrix.
dense <- suppressMessages(runAUCell(as.matrix(muraro.mat), scsigs))
sparse <- suppressMessages(runAUCell(muraro.mat, scsigs))

# These return different results.
all.equal(dense, sparse)
#> [1] "Attributes: < Component \"assays\": Attributes: < Component \"data\": Attributes: < Component \"listData\": Component \"AUC\": Mean relative difference: 0.02188224 > > >"

# Specifically, the AUC values (the key output of AUCell!) are different.
all.equal(getAUC(dense), getAUC(sparse))
#> [1] "Mean relative difference: 0.02188224"

# Consequently, assigning cells based on the max AUC, as done in OSCA, gives
# different results for a small number of cells.
all.equal(
  rownames(dense)[max.col(t(getAUC(dense)))],
  rownames(sparse)[max.col(t(getAUC(sparse)))])
#> [1] "21 string mismatches"

^{Created on 2022-10-13 with reprex v2.0.2}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.1 (2022-06-23) #> os macOS Big Sur ... 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Melbourne #> date 2022-10-13 #> pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> ! package * version date (UTC) lib source #> P annotate * 1.74.0 2022-04-26 [?] Bioconductor #> P AnnotationDbi * 1.58.0 2022-04-26 [?] Bioconductor #> P assertthat 0.2.1 2019-03-21 [?] CRAN (R 4.2.0) #> P AUCell * 1.18.1 2022-05-19 [?] Bioconductor #> P Biobase * 2.56.0 2022-04-26 [?] Bioconductor #> P BiocFileCache * 2.4.0 2022-04-26 [?] Bioconductor #> P BiocGenerics * 0.42.0 2022-04-26 [?] Bioconductor #> P BiocManager 1.30.18 2022-05-18 [?] CRAN (R 4.2.0) #> P BiocStyle 2.24.0 2022-04-26 [?] Bioconductor #> P Biostrings 2.64.1 2022-08-18 [?] Bioconductor #> P bit 4.0.4 2020-08-04 [?] CRAN (R 4.2.0) #> P bit64 4.0.5 2020-08-30 [?] CRAN (R 4.2.0) #> P bitops 1.0-7 2021-04-24 [?] CRAN (R 4.2.0) #> P blob 1.2.3 2022-04-10 [?] CRAN (R 4.2.0) #> P cachem 1.0.6 2021-08-19 [?] CRAN (R 4.2.0) #> P cli 3.4.1 2022-09-23 [?] CRAN (R 4.2.0) #> P CodeDepends 0.6.5 2018-07-17 [?] CRAN (R 4.2.0) #> P codetools 0.2-18 2020-11-04 [3] CRAN (R 4.2.1) #> P crayon 1.5.2 2022-09-29 [?] CRAN (R 4.2.0) #> P curl 4.3.3 2022-10-06 [?] CRAN (R 4.2.0) #> P data.table 1.14.2 2021-09-27 [?] CRAN (R 4.2.0) #> P DBI 1.1.3 2022-06-18 [?] CRAN (R 4.2.0) #> P dbplyr * 2.2.1 2022-06-27 [?] CRAN (R 4.2.0) #> P DelayedArray 0.22.0 2022-04-26 [?] Bioconductor #> P DelayedMatrixStats 1.18.1 2022-09-27 [?] Bioconductor #> P digest 0.6.29 2021-12-01 [?] CRAN (R 4.2.0) #> P dir.expiry 1.4.0 2022-04-26 [?] Bioconductor #> P dplyr 1.0.10 2022-09-01 [?] CRAN (R 4.2.0) #> P ellipsis 0.3.2 2021-04-29 [?] CRAN (R 4.2.0) #> P evaluate 0.17 2022-10-07 [?] CRAN (R 4.2.0) #> P fansi 1.0.3 2022-03-24 [?] CRAN (R 4.2.0) #> P fastmap 1.1.0 2021-01-25 [?] CRAN (R 4.2.0) #> P filelock 1.0.2 2018-10-05 [?] CRAN (R 4.2.0) #> P fs 1.5.2 2021-12-08 [?] CRAN (R 4.2.0) #> P generics 0.1.3 2022-07-05 [?] CRAN (R 4.2.0) #> P GenomeInfoDb * 1.32.4 2022-09-06 [?] Bioconductor #> P GenomeInfoDbData 1.2.8 2022-10-11 [?] Bioconductor #> P GenomicRanges * 1.48.0 2022-04-26 [?] Bioconductor #> P glue 1.6.2 2022-02-24 [?] CRAN (R 4.2.0) #> P graph * 1.74.0 2022-04-26 [?] Bioconductor #> P GSEABase * 1.58.0 2022-04-26 [?] Bioconductor #> P highr 0.9 2021-04-16 [?] CRAN (R 4.2.0) #> P htmltools 0.5.3 2022-07-18 [?] CRAN (R 4.2.0) #> P httpuv 1.6.6 2022-09-08 [?] CRAN (R 4.2.0) #> P httr 1.4.4 2022-08-17 [?] CRAN (R 4.2.0) #> P IRanges * 2.30.1 2022-08-18 [?] Bioconductor #> P KEGGREST 1.36.3 2022-07-14 [?] Bioconductor #> P knitr 1.40 2022-08-24 [?] CRAN (R 4.2.0) #> P later 1.3.0 2021-08-18 [?] CRAN (R 4.2.0) #> P lattice 0.20-45 2021-09-22 [3] CRAN (R 4.2.1) #> P lifecycle 1.0.3 2022-10-07 [?] CRAN (R 4.2.0) #> P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.2.0) #> P Matrix 1.5-1 2022-09-13 [3] CRAN (R 4.2.0) #> P MatrixGenerics * 1.8.1 2022-06-30 [?] Bioconductor #> P matrixStats * 0.62.0 2022-04-19 [?] CRAN (R 4.2.0) #> P memoise 2.0.1 2021-11-26 [?] CRAN (R 4.2.0) #> P mime 0.12 2021-09-28 [?] CRAN (R 4.2.0) #> P pillar 1.8.1 2022-08-19 [?] CRAN (R 4.2.0) #> P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.2.0) #> P png 0.1-7 2013-12-03 [?] CRAN (R 4.2.0) #> P promises 1.2.0.1 2021-02-11 [?] CRAN (R 4.2.0) #> P purrr 0.3.5 2022-10-06 [?] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [3] CRAN (R 4.2.0) #> P R.methodsS3 1.8.2 2022-06-13 [?] CRAN (R 4.2.0) #> P R.oo 1.25.0 2022-06-12 [?] CRAN (R 4.2.0) #> P R.utils 2.12.0 2022-06-28 [?] CRAN (R 4.2.0) #> P R6 2.5.1 2021-08-19 [?] CRAN (R 4.2.0) #> P rappdirs 0.3.3 2021-01-31 [?] CRAN (R 4.2.0) #> P Rcpp 1.0.9 2022-07-08 [?] CRAN (R 4.2.0) #> P RCurl 1.98-1.9 2022-10-03 [?] CRAN (R 4.2.0) #> P rebook * 1.6.0 2022-04-26 [?] Bioconductor #> P reprex 2.0.2 2022-08-17 [?] CRAN (R 4.2.0) #> P rlang 1.0.6 2022-09-24 [?] CRAN (R 4.2.0) #> P rmarkdown 2.17 2022-10-07 [?] CRAN (R 4.2.0) #> P RSQLite 2.2.18 2022-10-04 [?] CRAN (R 4.2.0) #> P rstudioapi 0.14 2022-08-22 [?] CRAN (R 4.2.0) #> P S4Vectors * 0.34.0 2022-04-26 [?] Bioconductor #> P sessioninfo 1.2.2 2021-12-06 [?] CRAN (R 4.2.0) #> P shiny 1.7.2 2022-07-19 [?] CRAN (R 4.2.0) #> P SingleCellExperiment * 1.18.1 2022-10-02 [?] Bioconductor #> P sparseMatrixStats 1.8.0 2022-04-26 [?] Bioconductor #> P stringi 1.7.8 2022-07-11 [?] CRAN (R 4.2.0) #> P stringr 1.4.1 2022-08-20 [?] CRAN (R 4.2.0) #> styler 1.7.0 2022-03-13 [3] CRAN (R 4.2.0) #> P SummarizedExperiment * 1.26.1 2022-05-01 [?] Bioconductor #> P tibble 3.1.8 2022-07-22 [?] CRAN (R 4.2.0) #> P tidyselect 1.2.0 2022-10-10 [?] CRAN (R 4.2.1) #> P utf8 1.2.2 2021-07-24 [?] CRAN (R 4.2.0) #> P vctrs 0.4.2 2022-09-29 [?] CRAN (R 4.2.0) #> P withr 2.5.0 2022-03-03 [?] CRAN (R 4.2.0) #> P xfun 0.33 2022-09-12 [?] CRAN (R 4.2.0) #> P XML * 3.99-0.11 2022-10-03 [?] CRAN (R 4.2.0) #> P xtable 1.8-4 2019-04-21 [?] CRAN (R 4.2.0) #> P XVector 0.36.0 2022-04-26 [?] Bioconductor #> P yaml 2.3.5 2022-02-21 [?] CRAN (R 4.2.0) #> P zlibbioc 1.42.0 2022-04-26 [?] Bioconductor #> #> [1] /Users/Peter/Library/Caches/org.R-project.R/R/renv/library/OSCA.basic-52015504/R-4.2/x86_64-apple-darwin17.0 #> [2] /Users/Peter/GitHub/OSCA.basic/renv/sandbox/R-4.2/x86_64-apple-darwin17.0/84ba8b13 #> [3] /Library/Frameworks/R.framework/Versions/4.2/Resources/library #> #> P ── Loaded and on-disk path mismatch. #> #> ────────────────────────────────────────────────────────────────────────────── ```

PeteHaitch commented 2 years ago

To summarise (and apologies if you've already reached these conclusions):

There are some changes to the outputs in the 'Cell type annotation' chapter that are independent of AUCell. These are minor and I think are fine to ignore.
There were changes made to AUCell in the current release cycle that mean we now get different plots. Perhaps this is correcting previously wrong outputs, I don't now and don't have time to dig into. But this is independent of the issue with AUCell::AUCell_buildRankings().
AUCell::AUCell_buildRankings() is unreliable. a. It gives different results for dense and sparse inputs. b. It does not automatically/gracefully handle sparse inputs as it should.

(2) is annoying, but I'm inclined to ignore it. @Alanocallaghan Please let me know your thoughts on ignoring (1) and (2).

(3a) is definitely a problem, but the AUCell authors don't seem inclined to fix this. My assumption (hope?) is that the calculations for dense inputs are correct/reliable, so I'm inclined to use those.

(3b) is the killer and is what is preventing a clean build of OSCA.basic in both BioC 3.15 and 3.16.

I think our options to fix (3b) are:

Remove usage of AUCell from OSCA, as proposed/implemented in https://github.com/OSCA-source/OSCA.basic/pull/7. This also means (2) and (3b) are no longer OSCA's problem.
Explicitly coerce inputs to AUCell::AUCell_buildRankings() to a dense matrix via as.matrix().
Add splitByBlocks to the problematic call to AUCell::AUCell_buildRankings().

I'm inclined to go with (2) and perhaps move to (1) during the next development cycle but could be swayed another way.

@Alanocallaghan Please let me know your thoughts.

alanocallaghan commented 2 years ago

Very thorough thanks Peter. I think you're right on all points; I am a bit worried by conclusion 3) but I don't really like removing a section without replacing it. As such I think option 2) for now and to consider option 1) next release is a good choice. Particularly as it would give us more time to evaluate options to replace it.

PeteHaitch commented 2 years ago

Thanks Alan. I've made the change in BioC 3.15 (https://github.com/OSCA-source/OSCA.basic/commit/6d09e8706d984e17ee1c24b09eef428a2e153585) and BioC 3.16 (4fb456a127ae25ee9dfb051a3d95ddbe1585c870). It passes for me on 3.15 and I've just started a check on 3.16 on my machine (macOS Intel) but I'm heading to bed so I won't know until the morning. But I'm fairly hopeful it'll pass there, too.

In any case, I've pushed both to the BioC git server with a version bump to trigger a build. Fingers crossed!

PeteHaitch commented 2 years ago

Following my overnight check on my laptop I found that another small tweak was required in BioC 3.16 (e0b0c9a110a92e85eec75acfe61e7997069b1504). I've now pushed this to BioC git server with a version bump. I'll keep an eye on the build reports over the weekend.

PeteHaitch commented 2 years ago

Success! Now building and passing checks in BioC 3.16 as well as BioC 3.15

OSCA-source / OSCA.basic