emptyDrops BioC 3.11 vs 3.12: different results with the OSCA pbmc 4k dataset

lcolladotor commented 3 years ago

Hi Aaron et al,

We (@Yalbibalderas @AnaBVA) noticed that the following code from OSCA returns different t-SNE plots. Eventually we narrowed down the issue to BioC 3.11 vs 3.12 (3.13 is the same as 3.12). In particular, the issue is with changes between those versions for DropletUtils.

The path to DropletUtils is documented at https://twitter.com/lcolladotor/status/1425242252872409092?s=20.

You can reproduce this with:

# Usemos datos de pbmc4k
library(BiocFileCache)
bfc <- BiocFileCache()
raw.path <- bfcrpath(bfc, file.path(
    "http://cf.10xgenomics.com/samples",
    "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"
))
untar(raw.path, exdir = file.path(tempdir(), "pbmc4k"))

library(DropletUtils)
library(Matrix)
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names = TRUE)

library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
    rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol
)

# Detección de _droplets_ con células
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))

table(e.out$FDR < 0.001, useNA = "ifany")

where BioC 3.11 returns

> table(e.out$FDR < 0.001, useNA = "ifany")

 FALSE   TRUE   <NA> 
  1056   4233 731991

and later versions return

> table(e.out$FDR < 0.001, useNA = "ifany")

 FALSE   TRUE   <NA> 
   989   4300 731991

The full code is below for getting to the tSNE (adapted from https://comunidadbioinfo.github.io/cdsb2021_scRNAseq/anotaci%C3%B3n-de-clusters-de-c%C3%A9lulas.html and ultimately from OSCA at https://bioconductor.org/books/release/OSCA/unfiltered-human-pbmcs-10x-genomics.html) is below.

```R # Usemos datos de pbmc4k library(BiocFileCache) bfc <- BiocFileCache() raw.path <- bfcrpath(bfc, file.path( "http://cf.10xgenomics.com/samples", "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz" )) untar(raw.path, exdir = file.path(tempdir(), "pbmc4k")) library(DropletUtils) library(Matrix) fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38") sce.pbmc <- read10xCounts(fname, col.names = TRUE) library(scater) rownames(sce.pbmc) <- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol ) # Detección de _droplets_ con células set.seed(100) e.out <- emptyDrops(counts(sce.pbmc)) table(e.out$FDR < 0.001, useNA = "ifany") library(EnsDb.Hsapiens.v86) location <- mapIds(EnsDb.Hsapiens.v86, keys = rowData(sce.pbmc)$ID, column = "SEQNAME", keytype = "GENEID" ) sce.pbmc <- sce.pbmc[, which(e.out$FDR <= 0.001)] # Control de calidad stats <- perCellQCMetrics(sce.pbmc, subsets = list(Mito = which(location == "MT")) ) high.mito <- isOutlier(stats$subsets_Mito_percent, type = "higher" ) sce.pbmc <- sce.pbmc[, !high.mito] # Normalización de los datos library(scran) set.seed(1000) clusters <- quickCluster(sce.pbmc) sce.pbmc <- computeSumFactors(sce.pbmc, cluster = clusters) sce.pbmc <- logNormCounts(sce.pbmc) ## Identificación de genes altamente variables set.seed(1001) dec.pbmc <- modelGeneVarByPoisson(sce.pbmc) top.pbmc <- getTopHVGs(dec.pbmc, prop = 0.1) ## Reducción de dimensiones set.seed(10000) sce.pbmc <- denoisePCA(sce.pbmc, subset.row = top.pbmc, technical = dec.pbmc ) set.seed(100000) sce.pbmc <- runTSNE(sce.pbmc, dimred = "PCA") set.seed(1000000) sce.pbmc <- runUMAP(sce.pbmc, dimred = "PCA") plotTSNE(sce.pbmc) options(width = 120) sessioninfo::session_info() ```

We think that it comes down to DropletUtils::emptyDrops() and between April 27 and October 27 2020 (when BioC 3.12 was bioc-devel), there's 2 commits that altered it, in particular we think that https://github.com/MarioniLab/DropletUtils/commit/71b161e44c61efa96226e04adef32a58d4c190db might be the root of the difference.

Based on the history, we see that commit was early in the BioC 3.12 devel cycle. With that in mind, I did a chimera installation with BioC 3.12 and the BioC 3.11 version of DropletUtils, specifically https://github.com/MarioniLab/DropletUtils/commit/51d00b01dc8d65266bd00232b0420d893d52f6ea (remotes::install_github("MarioniLab/DropletUtils@51d00b0")). That version reproduced the t-SNE and DropletUtils::emptyDrops() we see with BioC 3.11.

```R > options(width = 120) main* > sessioninfo::session_info() ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 4.0.5 Patched (2021-03-31 r80247) os macOS Big Sur 11.5.1 system x86_64, darwin17.0 ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2021-08-10 ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source AnnotationDbi * 1.52.0 2020-10-27 [1] Bioconductor AnnotationFilter * 1.14.0 2020-10-27 [1] Bioconductor askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2) assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.2) beachmat 2.6.4 2020-12-20 [1] Bioconductor beeswarm 0.3.1 2021-03-07 [1] CRAN (R 4.0.2) Biobase * 2.50.0 2020-10-27 [1] Bioconductor BiocFileCache * 1.14.0 2020-10-27 [1] Bioconductor BiocGenerics * 0.36.1 2021-04-16 [1] Bioconductor BiocNeighbors 1.8.2 2020-12-07 [1] Bioconductor BiocParallel 1.24.1 2020-11-06 [1] Bioconductor BiocSingular 1.6.0 2020-10-27 [1] Bioconductor biocthis 1.0.10 2021-02-26 [1] Bioconductor biomaRt 2.46.3 2021-02-09 [1] Bioconductor Biostrings 2.58.0 2020-10-27 [1] Bioconductor bit 4.0.4 2020-08-04 [1] CRAN (R 4.0.2) bit64 4.0.5 2020-08-30 [1] CRAN (R 4.0.2) bitops 1.0-7 2021-04-24 [1] CRAN (R 4.0.2) blob 1.2.1 2020-01-20 [1] CRAN (R 4.0.2) bluster 1.0.0 2020-10-27 [1] Bioconductor bookdown 0.22 2021-04-22 [1] CRAN (R 4.0.2) cachem 1.0.4 2021-02-13 [1] CRAN (R 4.0.2) callr 3.7.0 2021-04-20 [1] CRAN (R 4.0.2) cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.5) colorout 1.2-2 2020-11-03 [1] Github (jalvesaq/colorout@726d681) colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.2) cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.0.2) crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2) curl 4.3.1 2021-04-30 [1] CRAN (R 4.0.2) data.table 1.14.0 2021-02-21 [1] CRAN (R 4.0.2) DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.2) dbplyr * 2.1.1 2021-04-06 [1] CRAN (R 4.0.2) DelayedArray 0.16.3 2021-03-24 [1] Bioconductor DelayedMatrixStats 1.12.3 2021-02-03 [1] Bioconductor desc 1.3.0 2021-03-05 [1] CRAN (R 4.0.2) devtools * 2.4.0 2021-04-07 [1] CRAN (R 4.0.2) digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2) dplyr 1.0.5 2021-03-05 [1] CRAN (R 4.0.2) dqrng 0.3.0 2021-05-01 [1] CRAN (R 4.0.5) DropletUtils * 1.9.0 2021-08-11 [1] Github (MarioniLab/DropletUtils@51d00b0) edgeR 3.32.1 2021-01-14 [1] Bioconductor ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.2) EnsDb.Hsapiens.v86 * 2.99.0 2021-08-11 [1] Bioconductor ensembldb * 2.14.1 2021-04-19 [1] Bioconductor evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1) fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.2) farver 2.1.0 2021-02-28 [1] CRAN (R 4.0.2) fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.2) FNN 1.1.3 2019-02-15 [1] CRAN (R 4.0.2) fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2) GenomeInfoDb * 1.26.7 2021-04-08 [1] Bioconductor GenomeInfoDbData 1.2.4 2020-11-03 [1] Bioconductor GenomicAlignments 1.26.0 2020-10-27 [1] Bioconductor GenomicFeatures * 1.42.3 2021-04-04 [1] Bioconductor GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 4.0.2) ggplot2 * 3.3.3 2020-12-30 [1] CRAN (R 4.0.2) glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) gridExtra 2.3 2017-09-09 [1] CRAN (R 4.0.2) gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2) HDF5Array 1.18.1 2021-02-04 [1] Bioconductor hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.2) htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2) httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2) igraph 1.2.6 2020-10-06 [1] CRAN (R 4.0.2) IRanges * 2.24.1 2020-12-12 [1] Bioconductor irlba 2.3.3 2019-02-05 [1] CRAN (R 4.0.2) knitr 1.33 2021-04-24 [1] CRAN (R 4.0.2) labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.2) lattice 0.20-44 2021-05-02 [1] CRAN (R 4.0.5) lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.0.2) lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2) limma 3.46.0 2020-10-27 [1] Bioconductor locfit 1.5-9.4 2020-03-25 [1] CRAN (R 4.0.2) lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.0.2) magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2) Matrix * 1.3-2 2021-01-06 [1] CRAN (R 4.0.5) MatrixGenerics * 1.2.1 2021-01-30 [1] Bioconductor matrixStats * 0.58.0 2021-01-29 [1] CRAN (R 4.0.2) memoise 2.0.0 2021-01-26 [1] CRAN (R 4.0.3) munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2) openssl 1.4.4 2021-04-30 [1] CRAN (R 4.0.2) pillar 1.6.0 2021-04-13 [1] CRAN (R 4.0.2) pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.0.3) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.0.2) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2) processx 3.5.2 2021-04-30 [1] CRAN (R 4.0.2) progress 1.2.2 2019-05-16 [1] CRAN (R 4.0.2) prompt 1.0.1 2021-03-12 [1] CRAN (R 4.0.2) ProtGenerics 1.22.0 2020-10-27 [1] Bioconductor ps 1.6.0 2021-02-28 [1] CRAN (R 4.0.2) purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.0.2) R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.0.2) R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.0.2) R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2) rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.0.2) Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.2) RCurl 1.98-1.3 2021-03-16 [1] CRAN (R 4.0.2) remotes 2.3.0 2021-04-01 [1] CRAN (R 4.0.2) rhdf5 2.34.0 2020-10-27 [1] Bioconductor rhdf5filters 1.2.0 2020-10-27 [1] Bioconductor Rhdf5lib 1.12.1 2021-01-26 [1] Bioconductor rlang 0.4.11 2021-04-30 [1] CRAN (R 4.0.2) rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.4) rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2) Rsamtools 2.6.0 2020-10-27 [1] Bioconductor RSpectra 0.16-0 2019-12-01 [1] CRAN (R 4.0.2) RSQLite 2.2.7 2021-04-22 [1] CRAN (R 4.0.2) rsthemes 0.1.0 2020-11-03 [1] Github (gadenbuie/rsthemes@6391fe5) rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.2) rsvd 1.0.5 2021-04-16 [1] CRAN (R 4.0.2) rtracklayer 1.50.0 2020-10-27 [1] Bioconductor Rtsne 0.15 2018-11-10 [1] CRAN (R 4.0.2) S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2) scater * 1.18.6 2021-02-26 [1] Bioconductor scran * 1.18.7 2021-04-16 [1] Bioconductor scuttle 1.0.4 2020-12-17 [1] Bioconductor sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) SingleCellExperiment * 1.12.0 2020-10-27 [1] Bioconductor sparseMatrixStats 1.2.1 2021-02-02 [1] Bioconductor statmod 1.4.35 2020-10-19 [1] CRAN (R 4.0.2) stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.2) SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor suncalc 0.5.0 2019-04-03 [1] CRAN (R 4.0.2) testthat * 3.0.2 2021-02-14 [1] CRAN (R 4.0.2) tibble 3.1.1 2021-04-18 [1] CRAN (R 4.0.2) tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.0.2) usethis * 2.0.1 2021-02-10 [1] CRAN (R 4.0.2) utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.2) uwot 0.1.10 2020-12-15 [1] CRAN (R 4.0.2) vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.2) vipor 0.4.5 2017-03-22 [1] CRAN (R 4.0.2) viridis 0.6.0 2021-04-15 [1] CRAN (R 4.0.2) viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.0.2) withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.2) xfun 0.22 2021-03-11 [1] CRAN (R 4.0.2) XML 3.99-0.6 2021-03-16 [1] CRAN (R 4.0.2) xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.2) XVector 0.30.0 2020-10-28 [1] Bioconductor yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) zlibbioc 1.36.0 2020-10-28 [1] Bioconductor [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library main* > ```

I tried installing this version of DropletUtils https://github.com/MarioniLab/DropletUtils/commit/71b161e44c61efa96226e04adef32a58d4c190db but I can't due some compilation errors related to scuttle, which makes sense since in BioC 3.12 you changed many packages. I also tried a chimera of BioC 3.11 and upgrading DropletUtils to BioC 3.12 versions (with remotes::install_github()) around the time of the suspected root commit and related packages, but well, I couldn't get it to work tonight.

In any case, do you think that there's a way to get the same emptyDrops() results with current versions of DropletUtils as those we were getting back in BioC 3.11? My sense is that the answer is no since the internals of the function changed. Aka, it's not like a parameter changed and we can just change it from say FALSE to TRUE, etc.

All of this arose due to how in BioC 3.11, the clusters of CD4 and CD8 T-cells looked nicely separated in the t-SNE and don't with current versions as shown below.

Hmm. We can't figure out why these two plots are different with the same code, though different R versions.

BioC 3.11? @PeteHaitch https://t.co/8DJM6pagdj

BioC 3.13https://t.co/t8KiPTyU9D @Bioconductor #rstats @yalbi_ibm @AnaBetty2304 pic.twitter.com/P65u7dnVCo
— 🇲🇽 Leonardo Collado-Torres (@lcolladotor) August 10, 2021

And well, we wanted to be able that question when we teach that part of OSCA on Friday morning. Best, Leo

lcolladotor commented 3 years ago

From my embedded tweet that doesn't show up properly on GitHub

Peter Hickey's slides, likely made with BioC 3.11

BioC 3.13

I haven't tried the whole thing up to annotating the cell clusters with BioC 3.12, but the shape of the t-SNE is the same as in BioC 3.13.

PeteHaitch commented 3 years ago

FWIW I'm pretty certain figures in my slides are just copy+pasted from the version of OSCA at the time I made the slides.

lcolladotor commented 3 years ago

Thanks for the info Pete!

Here's a verification that the t-SNE shape issue is due to DropletUtils::emptyDrops() since saving the e.out <- emptyDrops() output in BioC 3.11 and using it with 3.13 returns the same shape of the t-SNE.

The annotated clusters and all that are different, likely due to other changes.

e.out_BioC3.11.rds.zip

```R # Usemos datos de pbmc4k library(BiocFileCache) bfc <- BiocFileCache() raw.path <- bfcrpath(bfc, file.path( "http://cf.10xgenomics.com/samples", "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz" )) untar(raw.path, exdir = file.path(tempdir(), "pbmc4k")) library(DropletUtils) library(Matrix) fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38") sce.pbmc <- read10xCounts(fname, col.names = TRUE) library(scater) rownames(sce.pbmc) <- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol ) # Detección de _droplets_ con células set.seed(100) # e.out <- emptyDrops(counts(sce.pbmc)) e.out <- readRDS("e.out_BioC3.11.rds") table(e.out$FDR < 0.001, useNA = "ifany") library(EnsDb.Hsapiens.v86) location <- mapIds(EnsDb.Hsapiens.v86, keys = rowData(sce.pbmc)$ID, column = "SEQNAME", keytype = "GENEID" ) sce.pbmc <- sce.pbmc[, which(e.out$FDR <= 0.001)] # Control de calidad stats <- perCellQCMetrics(sce.pbmc, subsets = list(Mito = which(location == "MT")) ) high.mito <- isOutlier(stats$subsets_Mito_percent, type = "higher" ) sce.pbmc <- sce.pbmc[, !high.mito] # Normalización de los datos library(scran) set.seed(1000) clusters <- quickCluster(sce.pbmc) sce.pbmc <- computeSumFactors(sce.pbmc, cluster = clusters) sce.pbmc <- logNormCounts(sce.pbmc) ## Identificación de genes altamente variables set.seed(1001) dec.pbmc <- modelGeneVarByPoisson(sce.pbmc) top.pbmc <- getTopHVGs(dec.pbmc, prop = 0.1) ## Reducción de dimensiones set.seed(10000) sce.pbmc <- denoisePCA(sce.pbmc, subset.row = top.pbmc, technical = dec.pbmc ) set.seed(100000) sce.pbmc <- runTSNE(sce.pbmc, dimred = "PCA") set.seed(1000000) sce.pbmc <- runUMAP(sce.pbmc, dimred = "PCA") plotTSNE(sce.pbmc) # clustering g <- buildSNNGraph(sce.pbmc, k = 10, use.dimred = "PCA") clust <- igraph::cluster_walktrap(g)$membership sce.pbmc$cluster <- factor(clust) library(celldex) ref <- celldex::BlueprintEncodeData() library(SingleR) pred <- SingleR( test = sce.pbmc, ref = ref, labels = ref$label.main ) sce.pbmc$labels <- pred$labels plotTSNE(sce.pbmc, colour_by = "labels", text_by = "labels") ```

Left: BioC 3.13 with e.out from BioC 3.11 Middle: Pete's slides, likely with BioC 3.11 Right: BioC 3.13

lcolladotor commented 3 years ago

Actually, the colors was due to

plotTSNE(sce.pbmc, colour_by = "labels", text_by = "labels")

vs

plotTSNE(sce.pbmc, colour_by = "cluster", text_by = "labels")

Left: BioC 3.13 with e.out from BioC 3.11 Right: Pete's slides

Show below is the difference between using BioC 3.13 with the old output from DropletUtils::emptyDrops(), which matches the BioC 3.11 output, and the new BioC 3.13 output. In terms of explaining methods, I don't think that the BioC 3.13 output is bad in any way. So we could potentially close this issue. Though in terms of reproducing DropletUtils::emptyDrops() output from BioC 3.11 with newer versions, well, that's a different story. My guess is that the answer is that it simply can't, like I said earlier.

Left: BioC 3.13 with e.out from BioC 3.111 Middle: Pete's slides with BioC 3.11 Right: BioC 3.13 only

Yalbibalderas commented 3 years ago

Ohh thanks!!

El mar., 10 de agosto de 2021 22:25, Leonardo Collado-Torres < @.***> escribió:

Actually, the colors was due to

plotTSNE(sce.pbmc, colour_by = "labels", text_by = "labels")

vs

plotTSNE(sce.pbmc, colour_by = "cluster", text_by = "labels")

[image: Screen Shot 2021-08-10 at 11 10 30 PM] https://user-images.githubusercontent.com/2288213/128963828-314c8272-bc68-488a-877b-1356e2aa7a44.png

Left: BioC 3.13 with e.out from BioC 3.11 Right: Pete's slides

Show below is the difference between using BioC 3.13 with the old output from DropletUtils::emptyDrops(), which matches the BioC 3.11 output, and the new BioC 3.13 output. In terms of explaining methods, I don't think that the BioC 3.13 output is bad in any way. So we could potentially close this issue. Though in terms of reproducing DropletUtils::emptyDrops() output from BioC 3.11 with newer versions, well, that's a different story. My guess is that the answer is that it simply can't, like I said earlier.

[image: Screen Shot 2021-08-10 at 11 20 16 PM] https://user-images.githubusercontent.com/2288213/128964570-6873f9c5-af6e-41bf-b326-23224e5b0302.png

Left: BioC 3.13 with e.out from BioC 3.111 Middle: Pete's slides with BioC 3.11 Right: BioC 3.13 only

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MarioniLab/DropletUtils/issues/67#issuecomment-896470899, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUHQZXGRU6IS2FGRQOTYIDT4HUS5ANCNFSM5B5IWV7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

LTLA commented 3 years ago

Well, if your scope is "somewhere between April and Oct 2020", then the more impactful change is that of #48, which adjusts the spline fit and presumably shifts the retain threshold a bit. You could attempt reverting this change by setting barcode.args=list(exclude.from=0), see the argument description in ?barcodeRanks.

lcolladotor commented 3 years ago

I see, thanks Aaron. I wasn't aware of that change, though well, with BioC 3.13 I get the same results with and without barcode.args=list(exclude.from=0). That is:

> e.out <- emptyDrops(counts(sce.pbmc), barcode.args=list(exclude.from=0))
> table(e.out$FDR < 0.001, useNA = "ifany")

 FALSE   TRUE   <NA> 
   989   4300 731991

```R > library(scran) Loading required package: SingleCellExperiment Loading required package: SummarizedExperiment Loading required package: MatrixGenerics Loading required package: matrixStats Attaching package: ‘MatrixGenerics’ The following objects are masked from ‘package:matrixStats’: colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins, colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars Loading required package: GenomicRanges Loading required package: stats4 Loading required package: BiocGenerics Loading required package: parallel Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:parallel’: clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min Loading required package: S4Vectors Attaching package: ‘S4Vectors’ The following objects are masked from ‘package:base’: expand.grid, I, unname Loading required package: IRanges Attaching package: ‘IRanges’ The following object is masked from ‘package:grDevices’: windows Loading required package: GenomeInfoDb Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Attaching package: ‘Biobase’ The following object is masked from ‘package:MatrixGenerics’: rowMedians The following objects are masked from ‘package:matrixStats’: anyMissing, rowMedians Loading required package: scuttle > library(DropletUtils) > ?emptyDrops > # Usemos datos de pbmc4k > library(BiocFileCache) Loading required package: dbplyr > bfc <- BiocFileCache() > raw.path <- bfcrpath(bfc, file.path( + "http://cf.10xgenomics.com/samples", + "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz" + )) > untar(raw.path, exdir = file.path(tempdir(), "pbmc4k")) > > library(DropletUtils) > library(Matrix) Attaching package: ‘Matrix’ The following object is masked from ‘package:S4Vectors’: expand > fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38") > sce.pbmc <- read10xCounts(fname, col.names = TRUE) > # Anotación de los genes > library(scater) Loading required package: ggplot2 > rownames(sce.pbmc) <- uniquifyFeatureNames( + rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol + ) > library(EnsDb.Hsapiens.v86) Loading required package: ensembldb Loading required package: GenomicFeatures Loading required package: AnnotationDbi Loading required package: AnnotationFilter Attaching package: ‘AnnotationFilter’ The following object is masked from ‘package:testthat’: not Attaching package: 'ensembldb' The following object is masked from 'package:stats': filter > location <- mapIds(EnsDb.Hsapiens.v86, + keys = rowData(sce.pbmc)$ID, + column = "SEQNAME", keytype = "GENEID" + ) Warning message: Unable to map 144 of 33694 requested IDs. > # Detección de _droplets_ con células > set.seed(100) > e.out <- emptyDrops(counts(sce.pbmc), barcode.args=list(exclude.from=0)) > table(e.out$FDR < 0.001, useNA = "ifany") FALSE TRUE 989 4300 731991 > options(width = 120) > sessioninfo::session_info() - Session info ------------------------------------------------------------------------------------------------------- setting value version R version 4.1.0 alpha (2021-04-20 r80202) os Windows 10 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 ctype English_United States.1252 tz America/New_York date 2021-08-11 - Packages ----------------------------------------------------------------------------------------------------------- ! package * version date lib source AnnotationDbi * 1.54.1 2021-06-08 [1] Bioconductor AnnotationFilter * 1.16.0 2021-05-19 [1] Bioconductor AnnotationHub 3.0.1 2021-06-20 [1] Bioconductor assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0) beachmat 2.8.0 2021-05-19 [1] Bioconductor beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.1.0) Biobase * 2.52.0 2021-05-19 [1] Bioconductor BiocFileCache * 2.0.0 2021-05-19 [1] Bioconductor BiocGenerics * 0.38.0 2021-05-19 [1] Bioconductor BiocIO 1.2.0 2021-05-19 [1] Bioconductor BiocManager 1.30.16 2021-06-15 [1] CRAN (R 4.1.0) BiocNeighbors 1.10.0 2021-05-19 [1] Bioconductor BiocParallel 1.26.1 2021-07-04 [1] Bioconductor BiocSingular 1.8.1 2021-06-08 [1] Bioconductor biocthis 1.2.0 2021-05-19 [1] Bioconductor BiocVersion 3.13.1 2021-03-08 [1] Bioconductor biomaRt 2.48.2 2021-07-01 [1] Bioconductor Biostrings 2.60.2 2021-08-05 [1] Bioconductor bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0) bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0) bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.0) blob 1.2.2 2021-07-23 [1] CRAN (R 4.1.0) bluster 1.2.1 2021-05-27 [1] Bioconductor bookdown 0.22 2021-04-22 [1] CRAN (R 4.1.0) bslib 0.2.5.1 2021-05-18 [1] CRAN (R 4.1.0) cachem 1.0.5 2021-05-15 [1] CRAN (R 4.1.0) Cairo 1.5-12.2 2020-07-07 [1] CRAN (R 4.1.0) callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0) celldex 1.2.0 2021-05-20 [1] Bioconductor circlize 0.4.13 2021-06-09 [1] CRAN (R 4.1.0) cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) clue 0.3-59 2021-04-16 [1] CRAN (R 4.1.0) cluster 2.1.2 2021-04-17 [1] CRAN (R 4.1.0) codetools 0.2-18 2020-11-04 [1] CRAN (R 4.1.0) colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0) colourpicker 1.1.0 2020-09-14 [1] CRAN (R 4.1.0) ComplexHeatmap 2.8.0 2021-05-19 [1] Bioconductor cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.1.0) crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0) data.table 1.14.0 2021-02-21 [1] CRAN (R 4.1.0) DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) dbplyr * 2.1.1 2021-04-06 [1] CRAN (R 4.1.0) DelayedArray 0.18.0 2021-05-19 [1] Bioconductor DelayedMatrixStats 1.14.2 2021-08-08 [1] Bioconductor desc 1.3.0 2021-03-05 [1] CRAN (R 4.1.0) devtools * 2.4.2 2021-06-07 [1] CRAN (R 4.1.0) digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0) doParallel 1.0.16 2020-10-16 [1] CRAN (R 4.1.0) dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) dqrng 0.3.0 2021-05-01 [1] CRAN (R 4.1.0) DropletUtils * 1.12.2 2021-07-22 [1] Bioconductor DT 0.18 2021-04-14 [1] CRAN (R 4.1.0) edgeR 3.34.0 2021-05-19 [1] Bioconductor ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) EnsDb.Hsapiens.v86 * 2.99.0 2021-08-08 [1] Bioconductor ensembldb * 2.16.4 2021-08-05 [1] Bioconductor evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) ExperimentHub 2.0.0 2021-05-19 [1] Bioconductor fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) filelock 1.0.2 2018-10-05 [1] CRAN (R 4.1.0) foreach 1.5.1 2020-10-15 [1] CRAN (R 4.1.0) fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) GenomeInfoDb * 1.28.1 2021-07-01 [1] Bioconductor GenomeInfoDbData 1.2.6 2021-08-07 [1] Bioconductor GenomicAlignments 1.28.0 2021-05-19 [1] Bioconductor GenomicFeatures * 1.44.0 2021-05-19 [1] Bioconductor GenomicRanges * 1.44.0 2021-05-19 [1] Bioconductor GetoptLong 1.0.5 2020-12-15 [1] CRAN (R 4.1.0) ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 4.1.0) ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0) ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.1.0) GlobalOptions 0.1.2 2020-06-10 [1] CRAN (R 4.1.0) glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) gridExtra 2.3 2017-09-09 [1] CRAN (R 4.1.0) gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0) HDF5Array 1.20.0 2021-05-19 [1] Bioconductor here 1.0.1 2020-12-13 [1] CRAN (R 4.1.0) hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0) htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0) htmlwidgets 1.5.3 2020-12-10 [1] CRAN (R 4.1.0) httpuv 1.6.1 2021-05-07 [1] CRAN (R 4.1.0) httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0) igraph 1.2.6 2020-10-06 [1] CRAN (R 4.1.0) interactiveDisplayBase 1.30.0 2021-05-19 [1] Bioconductor IRanges * 2.26.0 2021-05-19 [1] Bioconductor irlba 2.3.3 2019-02-05 [1] CRAN (R 4.1.0) iSEE 2.4.0 2021-05-19 [1] Bioconductor iterators 1.0.13 2020-10-15 [1] CRAN (R 4.1.0) jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.0) jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0) KEGGREST 1.32.0 2021-05-19 [1] Bioconductor knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0) later 1.2.0 2021-04-23 [1] CRAN (R 4.1.0) lattice 0.20-44 2021-05-02 [1] CRAN (R 4.1.0) lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.1.0) lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) limma 3.48.2 2021-08-08 [1] Bioconductor locfit 1.5-9.4 2020-03-25 [1] CRAN (R 4.1.0) lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0) magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) Matrix * 1.3-4 2021-06-01 [1] CRAN (R 4.1.0) MatrixGenerics * 1.4.2 2021-08-08 [1] Bioconductor matrixStats * 0.60.0 2021-07-26 [1] CRAN (R 4.1.0) memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0) metapod 1.0.0 2021-05-19 [1] Bioconductor mgcv 1.8-36 2021-06-01 [1] CRAN (R 4.1.0) mime 0.11 2021-06-23 [1] CRAN (R 4.1.0) miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.1.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0) nlme 3.1-152 2021-02-04 [1] CRAN (R 4.1.0) PCAtools 2.4.0 2021-05-19 [1] Bioconductor pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0) pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.1.0) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.1.0) plyr 1.8.6 2020-03-03 [1] CRAN (R 4.1.0) png 0.1-7 2013-12-03 [1] CRAN (R 4.1.0) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0) processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0) progress 1.2.2 2019-05-16 [1] CRAN (R 4.1.0) promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.1.0) ProtGenerics 1.24.0 2021-05-19 [1] Bioconductor ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0) purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.0) R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.0) R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.1.0) R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0) rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.0) RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.1.0) Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) RCurl 1.98-1.3 2021-03-16 [1] CRAN (R 4.1.0) remotes 2.4.0 2021-06-02 [1] CRAN (R 4.1.0) reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.1.0) restfulr 0.0.13 2017-08-06 [1] CRAN (R 4.1.0) rhdf5 2.36.0 2021-05-19 [1] Bioconductor D rhdf5filters 1.4.0 2021-05-19 [1] Bioconductor Rhdf5lib 1.14.2 2021-07-06 [1] Bioconductor rintrojs 0.3.0 2021-06-06 [1] CRAN (R 4.1.0) rjson 0.2.20 2018-06-08 [1] CRAN (R 4.1.0) rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) rmarkdown 2.10 2021-08-06 [1] CRAN (R 4.1.0) rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0) Rsamtools 2.8.0 2021-05-19 [1] Bioconductor RSQLite 2.2.7 2021-04-22 [1] CRAN (R 4.1.0) rsthemes 0.2.1.9000 2021-04-22 [1] Github (gadenbuie/rsthemes@19299e5) rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) rsvd 1.0.5 2021-04-16 [1] CRAN (R 4.1.0) rtracklayer 1.52.0 2021-05-19 [1] Bioconductor S4Vectors * 0.30.0 2021-05-19 [1] Bioconductor sass 0.4.0 2021-05-12 [1] CRAN (R 4.1.0) ScaledMatrix 1.0.0 2021-05-19 [1] Bioconductor scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0) scater * 1.20.1 2021-06-15 [1] Bioconductor scran * 1.20.1 2021-05-24 [1] Bioconductor scuttle * 1.2.1 2021-08-05 [1] Bioconductor sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) shape 1.4.6 2021-05-19 [1] CRAN (R 4.1.0) shiny 1.6.0 2021-01-25 [1] CRAN (R 4.1.0) shinyAce 0.4.1 2019-09-24 [1] CRAN (R 4.1.0) shinydashboard 0.7.1 2018-10-17 [1] CRAN (R 4.1.0) shinyjs 2.0.0 2020-09-09 [1] CRAN (R 4.1.0) shinyWidgets 0.6.0 2021-03-15 [1] CRAN (R 4.1.0) SingleCellExperiment * 1.14.1 2021-05-21 [1] Bioconductor sparseMatrixStats 1.4.2 2021-08-08 [1] Bioconductor statmod 1.4.36 2021-05-10 [1] CRAN (R 4.1.0) stringi 1.7.3 2021-07-16 [1] CRAN (R 4.1.0) stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) styler 1.5.1 2021-07-13 [1] CRAN (R 4.1.0) SummarizedExperiment * 1.22.0 2021-05-19 [1] Bioconductor suncalc 0.5.0 2019-04-03 [1] CRAN (R 4.1.0) testthat * 3.0.4 2021-07-01 [1] CRAN (R 4.1.0) tibble 3.1.3 2021-07-23 [1] CRAN (R 4.1.0) tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) usethis * 2.0.1 2021-02-10 [1] CRAN (R 4.1.0) utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) vipor 0.4.5 2017-03-22 [1] CRAN (R 4.1.0) viridis 0.6.1 2021-05-11 [1] CRAN (R 4.1.0) viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.1.0) withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) xfun 0.25 2021-08-06 [1] CRAN (R 4.1.0) XML 3.99-0.6 2021-03-16 [1] CRAN (R 4.1.0) xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0) xtable 1.8-4 2019-04-21 [1] CRAN (R 4.1.0) XVector 0.32.0 2021-05-19 [1] Bioconductor yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) zlibbioc 1.38.0 2021-05-19 [1] Bioconductor [1] C:/R/R-4.1.0alpha/library D -- DLL MD5 mismatch, broken installation. ```

lcolladotor commented 3 years ago

Under a first inspection, it doesn't seem to be that the following commit is the problem https://github.com/MarioniLab/DropletUtils/commit/71b161e44c61efa96226e04adef32a58d4c190db#diff-cf7a6dd4088835e532ce821c61ceb633ccc99ee3ffe8a812ab3c0a7f46dcaa8b either based on the test below and from looking at https://github.com/LTLA/beachmat/blob/d3e4bafbc840df2d106197d7163c3f3aa6fb1fb1/R/whichNonZero.R#L42-L44.

set.seed(100)
p.n0 <- runif(100)
j <- sample(1:10, 100, TRUE)
mat <- matrix(ncol = 10)
dim(mat)
#> [1]  1 10
max(j)
#> [1] 10

by.col <- aggregate(p.n0, list(Col=j), sum)
obs.P <- numeric(ncol(mat))
obs.P[by.col$Col] <- by.col$x
obs.P
#>  [1]  3.998302  4.518524  2.379189  5.237262  6.077561  4.126877  4.634267
#>  [8]  4.671363  4.275538 12.068117

j_new <- factor(j, levels=seq_len(ncol(mat)))
obs.P_new <- tapply(p.n0, INDEX=j_new, FUN=sum)
obs.P_new <- as.numeric(obs.P_new)
obs.P_new
#>  [1]  3.998302  4.518524  2.379189  5.237262  6.077561  4.126877  4.634267
#>  [8]  4.671363  4.275538 12.068117

testthat::expect_equal(obs.P, obs.P_new)
testthat::expect_equivalent(obs.P, obs.P_new)

^{Created on 2021-08-11 by the reprex package (v2.0.1)}

However, in that case, the full matrix mat has an equal number of columns as the max value of j and all columns of mat have j values.

More complicated case

However, let's say that the full matrix has 15 columns with columns 6 to 10 being all zeros.

## 100 random p-values for 100 random j columns
## in a sparse matrix
set.seed(100)
p.n0 <- runif(100)
## Let's say that we have empty colums 6 through 10
j <- sample(c(1:5, 11:15), 100, TRUE)
mat <- matrix(ncol = max(j))
dim(mat)
#> [1]  1 15

by.col <- aggregate(p.n0, list(Col=j), sum)
obs.P <- numeric(ncol(mat))
obs.P[by.col$Col] <- by.col$x
obs.P
#>  [1]  3.998302  4.518524  2.379189  5.237262  6.077561  0.000000  0.000000
#>  [8]  0.000000  0.000000  0.000000  4.126877  4.634267  4.671363  4.275538
#> [15] 12.068117

j_new <- factor(j, levels=seq_len(ncol(mat)))
obs.P_new <- tapply(p.n0, INDEX=j_new, FUN=sum)
obs.P_new <- as.numeric(obs.P_new)
obs.P_new
#>  [1]  3.998302  4.518524  2.379189  5.237262  6.077561        NA        NA
#>  [8]        NA        NA        NA  4.126877  4.634267  4.671363  4.275538
#> [15] 12.068117

testthat::expect_equal(obs.P, obs.P_new)
#> Error: `obs.P` not equal to `obs.P_new`.
#> 5/15 mismatches (average diff: NaN)
#> [6]  0 - NA == NA
#> [7]  0 - NA == NA
#> [8]  0 - NA == NA
#> [9]  0 - NA == NA
#> [10] 0 - NA == NA
testthat::expect_equivalent(obs.P, obs.P_new)
#> Error: `obs.P` not equivalent to `obs.P_new`.
#> 5/15 mismatches (average diff: NaN)
#> [6]  0 - NA == NA
#> [7]  0 - NA == NA
#> [8]  0 - NA == NA
#> [9]  0 - NA == NA
#> [10] 0 - NA == NA

^{Created on 2021-08-11 by the reprex package (v2.0.1)}

In that case, we end up having NAs in the obs.P_new object. The current version of the code would be https://github.com/MarioniLab/DropletUtils/blob/master/R/emptyDrops.R#L311-L327.

With the data used for this issue, I see that well, we do have lots of non-zero columns as expected.

```R # Usemos datos de pbmc4k library(BiocFileCache) bfc <- BiocFileCache() raw.path <- bfcrpath(bfc, file.path( "http://cf.10xgenomics.com/samples", "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz" )) untar(raw.path, exdir = file.path(tempdir(), "pbmc4k")) library(DropletUtils) library(Matrix) fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38") sce.pbmc <- read10xCounts(fname, col.names = TRUE) # Anotación de los genes library(scater) rownames(sce.pbmc) <- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol ) library(beachmat) mat <- counts(sce.pbmc) dim(mat) nonzero <- whichNonZero(mat) length(unique(nonzero$j)) max(nonzero$j) ```

where max(j) corresponds to ncol(mat).

> library(beachmat)
> 
> mat <- counts(sce.pbmc)
> dim(mat)
[1]  33694 737280
> nonzero <- whichNonZero(mat)
> length(unique(nonzero$j))
[1] 272442
> max(nonzero$j)
[1] 737280

lcolladotor commented 3 years ago

pressed the wrong button while drafting my message....

lcolladotor commented 3 years ago

Ok, this does seem to be the issue with the actual data as shown in the reprex below with BioC 3.13.

I imagine that it doesn't matter for most droplets since well, they already had NA p-values in the past.

# Usemos datos de pbmc4k
library(BiocFileCache)
#> Loading required package: dbplyr
bfc <- BiocFileCache()
raw.path <- bfcrpath(bfc, file.path(
    "http://cf.10xgenomics.com/samples",
    "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"
))
untar(raw.path, exdir = file.path(tempdir(), "pbmc4k"))

library(DropletUtils)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> Loading required package: IRanges
#> 
#> Attaching package: 'IRanges'
#> The following object is masked from 'package:grDevices':
#> 
#>     windows
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
library(Matrix)
#> 
#> Attaching package: 'Matrix'
#> The following object is masked from 'package:S4Vectors':
#> 
#>     expand
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names = TRUE)

# Anotación de los genes
library(scater)
#> Loading required package: scuttle
#> Loading required package: ggplot2
rownames(sce.pbmc) <- uniquifyFeatureNames(
    rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol
)

library(beachmat)

mat <- counts(sce.pbmc)
dim(mat)
#> [1]  33694 737280
nonzero <- whichNonZero(mat)
length(unique(nonzero$j))
#> [1] 272442
max(nonzero$j)
#> [1] 737280

m <- DropletUtils:::.rounded_to_integer(mat, 0)
nonzero <- whichNonZero(m)

i <- nonzero$i
j <- nonzero$j
x <- nonzero$x

set.seed(100)
p.n0 <- runif(length(j))

by.col <- aggregate(p.n0, list(Col=j), sum)
obs.P <- numeric(ncol(mat))
obs.P[by.col$Col] <- by.col$x
summary(obs.P)
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#>    0.0000    0.0000    0.0000    5.7792    0.7371 2622.1900

j_new <- factor(j, levels=seq_len(ncol(m)))
obs.P_new <- tapply(p.n0, INDEX=j_new, FUN=sum)
obs.P_new <- as.numeric(obs.P_new)
summary(obs.P_new)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>     0.0     0.6     1.3    15.6     9.1  2622.2  464838

testthat::expect_equal(obs.P, obs.P_new)
#> Error: `obs.P` not equal to `obs.P_new`.
#> 464838/737280 mismatches (average diff: NaN)
#> [2]  0 - NA == NA
#> [4]  0 - NA == NA
#> [7]  0 - NA == NA
#> [8]  0 - NA == NA
#> [9]  0 - NA == NA
#> [11] 0 - NA == NA
#> [16] 0 - NA == NA
#> [17] 0 - NA == NA
#> [18] 0 - NA == NA
#> ...
testthat::expect_equivalent(obs.P, obs.P_new)
#> Error: `obs.P` not equivalent to `obs.P_new`.
#> 464838/737280 mismatches (average diff: NaN)
#> [2]  0 - NA == NA
#> [4]  0 - NA == NA
#> [7]  0 - NA == NA
#> [8]  0 - NA == NA
#> [9]  0 - NA == NA
#> [11] 0 - NA == NA
#> [16] 0 - NA == NA
#> [17] 0 - NA == NA
#> [18] 0 - NA == NA
#> ...

^{Created on 2021-08-11 by the reprex package (v2.0.1)}

lcolladotor commented 3 years ago

Adding obs.P[is.na(obs.P)] <- 0 or obs.P[!seq_len(ncol(block)) %in% unique(j)] <- 0 resolves the issue (with object names from https://github.com/MarioniLab/DropletUtils/blob/master/R/emptyDrops.R#L311-L327) in the reprex.

## 100 random p-values for 100 random j columns
## in a sparse matrix
set.seed(100)
p.n0 <- runif(100)

## Let's say that we have empty colums 6 through 10
j <- sample(c(1:5, 11:15), 100, TRUE)
mat <- matrix(ncol = max(j))
dim(mat)
#> [1]  1 15

by.col <- aggregate(p.n0, list(Col=j), sum)
obs.P <- numeric(ncol(mat))
obs.P[by.col$Col] <- by.col$x

j_new <- factor(j, levels=seq_len(ncol(mat)))
obs.P_new <- tapply(p.n0, INDEX=j_new, FUN=sum)
obs.P_new <- as.numeric(obs.P_new)

## This works
# obs.P_new[is.na(obs.P_new)] <- 0
## But well, maybe there are NAs in the original p.n0 already

## Here's a more robust fix to simply add 0
## to the columns with no data
obs.P_new[!seq_len(ncol(mat)) %in% unique(j)] <- 0

data.frame(obs.P, obs.P_new)
#>        obs.P obs.P_new
#> 1   3.998302  3.998302
#> 2   4.518524  4.518524
#> 3   2.379189  2.379189
#> 4   5.237262  5.237262
#> 5   6.077561  6.077561
#> 6   0.000000  0.000000
#> 7   0.000000  0.000000
#> 8   0.000000  0.000000
#> 9   0.000000  0.000000
#> 10  0.000000  0.000000
#> 11  4.126877  4.126877
#> 12  4.634267  4.634267
#> 13  4.671363  4.671363
#> 14  4.275538  4.275538
#> 15 12.068117 12.068117

testthat::expect_equal(obs.P, obs.P_new)
testthat::expect_equivalent(obs.P, obs.P_new)

^{Created on 2021-08-11 by the reprex package (v2.0.1)}

This is true with the full data too.

``` r # Usemos datos de pbmc4k library(BiocFileCache) #> Loading required package: dbplyr bfc <- BiocFileCache() raw.path <- bfcrpath(bfc, file.path( "http://cf.10xgenomics.com/samples", "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz" )) untar(raw.path, exdir = file.path(tempdir(), "pbmc4k")) library(DropletUtils) #> Loading required package: SingleCellExperiment #> Loading required package: SummarizedExperiment #> Loading required package: MatrixGenerics #> Loading required package: matrixStats #> #> Attaching package: 'MatrixGenerics' #> The following objects are masked from 'package:matrixStats': #> #> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, #> colCounts, colCummaxs, colCummins, colCumprods, colCumsums, #> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, #> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, #> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, #> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, #> colWeightedMeans, colWeightedMedians, colWeightedSds, #> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, #> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, #> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, #> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, #> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, #> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, #> rowWeightedMads, rowWeightedMeans, rowWeightedMedians, #> rowWeightedSds, rowWeightedVars #> Loading required package: GenomicRanges #> Loading required package: stats4 #> Loading required package: BiocGenerics #> Loading required package: parallel #> #> Attaching package: 'BiocGenerics' #> The following objects are masked from 'package:parallel': #> #> clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, #> clusterExport, clusterMap, parApply, parCapply, parLapply, #> parLapplyLB, parRapply, parSapply, parSapplyLB #> The following objects are masked from 'package:stats': #> #> IQR, mad, sd, var, xtabs #> The following objects are masked from 'package:base': #> #> anyDuplicated, append, as.data.frame, basename, cbind, colnames, #> dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, #> grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, #> order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, #> rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, #> union, unique, unsplit, which.max, which.min #> Loading required package: S4Vectors #> #> Attaching package: 'S4Vectors' #> The following objects are masked from 'package:base': #> #> expand.grid, I, unname #> Loading required package: IRanges #> #> Attaching package: 'IRanges' #> The following object is masked from 'package:grDevices': #> #> windows #> Loading required package: GenomeInfoDb #> Loading required package: Biobase #> Welcome to Bioconductor #> #> Vignettes contain introductory material; view with #> 'browseVignettes()'. To cite Bioconductor, see #> 'citation("Biobase")', and for packages 'citation("pkgname")'. #> #> Attaching package: 'Biobase' #> The following object is masked from 'package:MatrixGenerics': #> #> rowMedians #> The following objects are masked from 'package:matrixStats': #> #> anyMissing, rowMedians library(Matrix) #> #> Attaching package: 'Matrix' #> The following object is masked from 'package:S4Vectors': #> #> expand fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38") sce.pbmc <- read10xCounts(fname, col.names = TRUE) # Anotación de los genes library(scater) #> Loading required package: scuttle #> Loading required package: ggplot2 rownames(sce.pbmc) <- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol ) library(beachmat) mat <- counts(sce.pbmc) dim(mat) #> [1] 33694 737280 nonzero <- whichNonZero(mat) length(unique(nonzero$j)) #> [1] 272442 max(nonzero$j) #> [1] 737280 m <- DropletUtils:::.rounded_to_integer(mat, 0) nonzero <- whichNonZero(m) i <- nonzero$i j <- nonzero$j x <- nonzero$x set.seed(100) p.n0 <- runif(length(j)) by.col <- aggregate(p.n0, list(Col=j), sum) obs.P <- numeric(ncol(mat)) obs.P[by.col$Col] <- by.col$x summary(obs.P) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.0000 0.0000 0.0000 5.7792 0.7371 2622.1900 j_new <- factor(j, levels=seq_len(ncol(m))) obs.P_new <- tapply(p.n0, INDEX=j_new, FUN=sum) obs.P_new <- as.numeric(obs.P_new) obs.P_new[!seq_len(ncol(m)) %in% unique(j)] <- 0 summary(obs.P_new) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.0000 0.0000 0.0000 5.7792 0.7371 2622.1900 testthat::expect_equal(obs.P, obs.P_new) testthat::expect_equivalent(obs.P, obs.P_new) ``` ^{Created on 2021-08-11 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)}

Next, I'll try on a fresh clone of DropletUtils with the above fix and see if it works.

lcolladotor commented 3 years ago

Bah, I still can't get the BioC 3.11 results even with these changes :/

> set.seed(100)
> e.out <- emptyDrops(counts(sce.pbmc), barcode.args=list(exclude.from=0))
> table(e.out$FDR < 0.001, useNA = "ifany")

 FALSE   TRUE   <NA>
   989   4300 731991

It didn't work with https://github.com/lcolladotor/DropletUtils/commit/790d099508197fec4163a6909d42b4f5b6e7b715

```R > table(e.out$FDR < 0.001, useNA = "ifany") FALSE TRUE 989 4300 731991 fix_emptyDrops_bioc3.11_vs_3.12 > fix_emptyDrops_bioc3.11_vs_3.12 > options(width = 120) fix_emptyDrops_bioc3.11_vs_3.12 > sessioninfo::session_info() ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 4.1.0 (2021-05-18) os macOS Big Sur 10.16 system x86_64, darwin17.0 ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2021-08-11 ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) beachmat 2.9.0 2021-05-19 [1] Bioconductor beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.1.0) Biobase * 2.53.0 2021-05-19 [1] Bioconductor BiocFileCache * 2.1.1 2021-06-23 [1] Bioconductor BiocGenerics * 0.39.1 2021-06-08 [1] Bioconductor BiocNeighbors 1.11.0 2021-05-19 [1] Bioconductor BiocParallel 1.27.2 2021-07-12 [1] Bioconductor BiocSingular 1.9.1 2021-06-08 [1] Bioconductor bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0) bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0) bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.0) blob 1.2.2 2021-07-23 [1] CRAN (R 4.1.0) cachem 1.0.5 2021-05-15 [1] CRAN (R 4.1.0) callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0) cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) colorout 1.2-2 2021-05-24 [1] Github (jalvesaq/colorout@79931fd) colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0) crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0) DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) dbplyr * 2.1.1 2021-04-06 [1] CRAN (R 4.1.0) DelayedArray 0.19.1 2021-06-25 [1] Bioconductor DelayedMatrixStats 1.15.2 2021-08-05 [1] Bioconductor desc 1.3.0 2021-03-05 [1] CRAN (R 4.1.0) devtools * 2.4.2 2021-06-07 [1] CRAN (R 4.1.0) dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) dqrng 0.3.0 2021-05-01 [1] CRAN (R 4.1.0) DropletUtils * 1.13.3 2021-08-11 [1] Github (lcolladotor/DropletUtils@790d099) edgeR 3.35.0 2021-05-19 [1] Bioconductor ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) filelock 1.0.2 2018-10-05 [1] CRAN (R 4.1.0) fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) GenomeInfoDb * 1.29.3 2021-07-01 [1] Bioconductor GenomeInfoDbData 1.2.6 2021-05-24 [1] Bioconductor GenomicRanges * 1.45.0 2021-05-19 [1] Bioconductor ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 4.1.0) ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0) ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.1.0) glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) gridExtra 2.3 2017-09-09 [1] CRAN (R 4.1.0) gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0) HDF5Array 1.21.0 2021-05-19 [1] Bioconductor httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0) IRanges * 2.27.0 2021-05-19 [1] Bioconductor irlba 2.3.3 2019-02-05 [1] CRAN (R 4.1.0) lattice 0.20-44 2021-05-02 [1] CRAN (R 4.1.0) lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) limma 3.49.4 2021-08-08 [1] Bioconductor locfit 1.5-9.4 2020-03-25 [1] CRAN (R 4.1.0) magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) Matrix * 1.3-4 2021-06-01 [1] CRAN (R 4.1.0) MatrixGenerics * 1.5.3 2021-08-05 [1] Bioconductor matrixStats * 0.60.0 2021-07-26 [1] CRAN (R 4.1.0) memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0) pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0) pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.1.0) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.1.0) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0) processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0) prompt 1.0.1 2021-08-03 [1] Github (gaborcsardi/prompt@fc2ac94) ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0) purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.0) R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.0) R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.1.0) R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0) rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.0) Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) RCurl 1.98-1.3 2021-03-16 [1] CRAN (R 4.1.0) remotes 2.4.0 2021-06-02 [1] CRAN (R 4.1.0) rhdf5 2.37.0 2021-05-19 [1] Bioconductor rhdf5filters 1.5.0 2021-05-19 [1] Bioconductor Rhdf5lib 1.15.2 2021-07-01 [1] Bioconductor rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0) RSQLite 2.2.7 2021-04-22 [1] CRAN (R 4.1.0) rsthemes 0.2.1.9000 2021-05-24 [1] Github (gadenbuie/rsthemes@19299e5) rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) rsvd 1.0.5 2021-04-16 [1] CRAN (R 4.1.0) S4Vectors * 0.31.0 2021-05-19 [1] Bioconductor ScaledMatrix 1.1.0 2021-05-19 [1] Bioconductor scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0) scater * 1.21.3 2021-08-01 [1] Bioconductor scuttle * 1.3.1 2021-08-05 [1] Bioconductor sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) SingleCellExperiment * 1.15.1 2021-05-21 [1] Bioconductor sparseMatrixStats 1.5.2 2021-08-05 [1] Bioconductor SummarizedExperiment * 1.23.1 2021-06-24 [1] Bioconductor testthat * 3.0.4 2021-07-01 [1] CRAN (R 4.1.0) tibble 3.1.3 2021-07-23 [1] CRAN (R 4.1.0) tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) usethis * 2.0.1 2021-02-10 [1] CRAN (R 4.1.0) utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) vipor 0.4.5 2017-03-22 [1] CRAN (R 4.1.0) viridis 0.6.1 2021-05-11 [1] CRAN (R 4.1.0) viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.1.0) withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) XVector 0.33.0 2021-05-19 [1] Bioconductor zlibbioc 1.39.0 2021-05-19 [1] Bioconductor [1] /Library/Frameworks/R.framework/Versions/4.1devel/Resources/library ```

nor https://github.com/lcolladotor/DropletUtils/commit/2b0bb4bdfed87d538ef314f5a916f7d539936177, both with BioC 3.14 (bioc-devel).

```R > library(BiocFileCache) Loading required package: dbplyr fix_emptyDrops_bioc3.11_vs_3.12 > bfc <- BiocFileCache() fix_emptyDrops_bioc3.11_vs_3.12 > raw.path <- bfcrpath(bfc, file.path( + "http://cf.10xgenomics.com/samples", + "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz" + )) fix_emptyDrops_bioc3.11_vs_3.12 > untar(raw.path, exdir = file.path(tempdir(), "pbmc4k")) fix_emptyDrops_bioc3.11_vs_3.12 > fix_emptyDrops_bioc3.11_vs_3.12 > library(DropletUtils) Loading required package: SingleCellExperiment Loading required package: SummarizedExperiment Loading required package: MatrixGenerics Loading required package: matrixStats Attaching package: ‘MatrixGenerics’ The following objects are masked from ‘package:matrixStats’: colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins, colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars Loading required package: GenomicRanges Loading required package: stats4 Loading required package: BiocGenerics Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min Loading required package: S4Vectors Attaching package: ‘S4Vectors’ The following objects are masked from ‘package:base’: expand.grid, I, unname Loading required package: IRanges Loading required package: GenomeInfoDb Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Attaching package: ‘Biobase’ The following object is masked from ‘package:MatrixGenerics’: rowMedians The following objects are masked from ‘package:matrixStats’: anyMissing, rowMedians fix_emptyDrops_bioc3.11_vs_3.12 > library(Matrix) Attaching package: ‘Matrix’ The following object is masked from ‘package:S4Vectors’: expand fix_emptyDrops_bioc3.11_vs_3.12 > fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38") fix_emptyDrops_bioc3.11_vs_3.12 > sce.pbmc <- read10xCounts(fname, col.names = TRUE) fix_emptyDrops_bioc3.11_vs_3.12 > fix_emptyDrops_bioc3.11_vs_3.12 > library(scater) Loading required package: scuttle Loading required package: ggplot2 rownames(sce.pbmc) <- uniquifyFeatureNames( fix_emptyDrops_bioc3.11_vs_3.12 > rownames(sce.pbmc) <- uniquifyFeatureNames( + rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol + ) fix_emptyDrops_bioc3.11_vs_3.12 > fix_emptyDrops_bioc3.11_vs_3.12 > # Detección de _droplets_ con células fix_emptyDrops_bioc3.11_vs_3.12 > set.seed(100) fix_emptyDrops_bioc3.11_vs_3.12 > #e.out <- readRDS("e.out_BioC3.11.rds") fix_emptyDrops_bioc3.11_vs_3.12 > e.out <- emptyDrops(counts(sce.pbmc), barcode.args=list(exclude.from=0)) fix_emptyDrops_bioc3.11_vs_3.12 > table(e.out$FDR < 0.001, useNA = "ifany") FALSE TRUE 989 4300 731991 fix_emptyDrops_bioc3.11_vs_3.12 > options(width = 120) fix_emptyDrops_bioc3.11_vs_3.12 > sessioninfo::session_info() ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 4.1.0 (2021-05-18) os macOS Big Sur 10.16 system x86_64, darwin17.0 ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2021-08-11 ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) beachmat 2.9.0 2021-05-19 [1] Bioconductor beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.1.0) Biobase * 2.53.0 2021-05-19 [1] Bioconductor BiocFileCache * 2.1.1 2021-06-23 [1] Bioconductor BiocGenerics * 0.39.1 2021-06-08 [1] Bioconductor BiocNeighbors 1.11.0 2021-05-19 [1] Bioconductor BiocParallel 1.27.2 2021-07-12 [1] Bioconductor BiocSingular 1.9.1 2021-06-08 [1] Bioconductor bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0) bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0) bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.0) blob 1.2.2 2021-07-23 [1] CRAN (R 4.1.0) cachem 1.0.5 2021-05-15 [1] CRAN (R 4.1.0) callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0) cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) colorout 1.2-2 2021-05-24 [1] Github (jalvesaq/colorout@79931fd) colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0) crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0) DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) dbplyr * 2.1.1 2021-04-06 [1] CRAN (R 4.1.0) DelayedArray 0.19.1 2021-06-25 [1] Bioconductor DelayedMatrixStats 1.15.2 2021-08-05 [1] Bioconductor desc 1.3.0 2021-03-05 [1] CRAN (R 4.1.0) devtools * 2.4.2 2021-06-07 [1] CRAN (R 4.1.0) dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) dqrng 0.3.0 2021-05-01 [1] CRAN (R 4.1.0) DropletUtils * 1.13.3 2021-08-11 [1] Github (lcolladotor/DropletUtils@2b0bb4b) edgeR 3.35.0 2021-05-19 [1] Bioconductor ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) filelock 1.0.2 2018-10-05 [1] CRAN (R 4.1.0) fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) GenomeInfoDb * 1.29.3 2021-07-01 [1] Bioconductor GenomeInfoDbData 1.2.6 2021-05-24 [1] Bioconductor GenomicRanges * 1.45.0 2021-05-19 [1] Bioconductor ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 4.1.0) ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0) ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.1.0) glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) gridExtra 2.3 2017-09-09 [1] CRAN (R 4.1.0) gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0) HDF5Array 1.21.0 2021-05-19 [1] Bioconductor httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0) IRanges * 2.27.0 2021-05-19 [1] Bioconductor irlba 2.3.3 2019-02-05 [1] CRAN (R 4.1.0) lattice 0.20-44 2021-05-02 [1] CRAN (R 4.1.0) lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) limma 3.49.4 2021-08-08 [1] Bioconductor locfit 1.5-9.4 2020-03-25 [1] CRAN (R 4.1.0) magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) Matrix * 1.3-4 2021-06-01 [1] CRAN (R 4.1.0) MatrixGenerics * 1.5.3 2021-08-05 [1] Bioconductor matrixStats * 0.60.0 2021-07-26 [1] CRAN (R 4.1.0) memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0) pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0) pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.1.0) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.1.0) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0) processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0) prompt 1.0.1 2021-08-03 [1] Github (gaborcsardi/prompt@fc2ac94) ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0) purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.0) R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.0) R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.1.0) R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0) rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.0) Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) RCurl 1.98-1.3 2021-03-16 [1] CRAN (R 4.1.0) remotes 2.4.0 2021-06-02 [1] CRAN (R 4.1.0) rhdf5 2.37.0 2021-05-19 [1] Bioconductor rhdf5filters 1.5.0 2021-05-19 [1] Bioconductor Rhdf5lib 1.15.2 2021-07-01 [1] Bioconductor rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0) RSQLite 2.2.7 2021-04-22 [1] CRAN (R 4.1.0) rsthemes 0.2.1.9000 2021-05-24 [1] Github (gadenbuie/rsthemes@19299e5) rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) rsvd 1.0.5 2021-04-16 [1] CRAN (R 4.1.0) S4Vectors * 0.31.0 2021-05-19 [1] Bioconductor ScaledMatrix 1.1.0 2021-05-19 [1] Bioconductor scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0) scater * 1.21.3 2021-08-01 [1] Bioconductor scuttle * 1.3.1 2021-08-05 [1] Bioconductor sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) SingleCellExperiment * 1.15.1 2021-05-21 [1] Bioconductor sparseMatrixStats 1.5.2 2021-08-05 [1] Bioconductor SummarizedExperiment * 1.23.1 2021-06-24 [1] Bioconductor testthat * 3.0.4 2021-07-01 [1] CRAN (R 4.1.0) tibble 3.1.3 2021-07-23 [1] CRAN (R 4.1.0) tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) usethis * 2.0.1 2021-02-10 [1] CRAN (R 4.1.0) utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) vipor 0.4.5 2017-03-22 [1] CRAN (R 4.1.0) viridis 0.6.1 2021-05-11 [1] CRAN (R 4.1.0) viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.1.0) withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) XVector 0.33.0 2021-05-19 [1] Bioconductor zlibbioc 1.39.0 2021-05-19 [1] Bioconductor [1] /Library/Frameworks/R.framework/Versions/4.1devel/Resources/library ```

LTLA commented 3 years ago

I would suggest looking at 584320df70b0c3f8991a1a1485b660225860f780. Reverting that change and running:

> set.seed(100)
> e.out <- emptyDrops(counts(sce.pbmc), barcode.args=list(exclude.from=0))
> table(e.out$FDR < 0.001, useNA = "ifany")

 FALSE   TRUE   <NA> 
  1056   4233 731991

This is mentioned briefly in the NEWS.

lcolladotor commented 3 years ago

Oh interesting, thanks Aaron! I'll do that tomorrow morning. Thanks!

lcolladotor commented 3 years ago

Hi Aaron,

Reverting the bugfix reproduces prior results

Using https://github.com/lcolladotor/DropletUtils/commit/eb8eb45bb2f9571931435ab5f382f56b99e46c71 which reverts https://github.com/MarioniLab/DropletUtils/commit/584320df70b0c3f8991a1a1485b660225860f780 on BioC 3.14 I do indeed get the same results I get with BioC 3.13 using the emptyDrops() output from BioC 3.11. Thanks!

e.out_BioC3.14_lcolladotor_eb8eb45.rds.zip

```R # Usemos datos de pbmc4k library(BiocFileCache) bfc <- BiocFileCache() raw.path <- bfcrpath(bfc, file.path( "http://cf.10xgenomics.com/samples", "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz" )) untar(raw.path, exdir = file.path(tempdir(), "pbmc4k")) library(DropletUtils) library(Matrix) fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38") sce.pbmc <- read10xCounts(fname, col.names = TRUE) library(scater) rownames(sce.pbmc) <- uniquifyFeatureNames( rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol ) # Detección de _droplets_ con células set.seed(100) e.out <- emptyDrops(counts(sce.pbmc), barcode.args=list(exclude.from=0)) #e.out <- readRDS("~/Desktop/e.out_BioC3.11.rds") table(e.out$FDR < 0.001, useNA = "ifany") # FALSE TRUE # 1056 4233 731991 saveRDS(e.out, file = "~/Desktop/e.out_BioC3.14_lcolladotor_eb8eb45.rds") library(EnsDb.Hsapiens.v86) location <- mapIds(EnsDb.Hsapiens.v86, keys = rowData(sce.pbmc)$ID, column = "SEQNAME", keytype = "GENEID" ) sce.pbmc <- sce.pbmc[, which(e.out$FDR <= 0.001)] # Control de calidad stats <- perCellQCMetrics(sce.pbmc, subsets = list(Mito = which(location == "MT")) ) high.mito <- isOutlier(stats$subsets_Mito_percent, type = "higher" ) sce.pbmc <- sce.pbmc[, !high.mito] # Normalización de los datos library(scran) set.seed(1000) clusters <- quickCluster(sce.pbmc) sce.pbmc <- computeSumFactors(sce.pbmc, cluster = clusters) sce.pbmc <- logNormCounts(sce.pbmc) ## Identificación de genes altamente variables set.seed(1001) dec.pbmc <- modelGeneVarByPoisson(sce.pbmc) top.pbmc <- getTopHVGs(dec.pbmc, prop = 0.1) ## Reducción de dimensiones set.seed(10000) sce.pbmc <- denoisePCA(sce.pbmc, subset.row = top.pbmc, technical = dec.pbmc ) set.seed(100000) sce.pbmc <- runTSNE(sce.pbmc, dimred = "PCA") set.seed(1000000) sce.pbmc <- runUMAP(sce.pbmc, dimred = "PCA") plotTSNE(sce.pbmc) # clustering g <- buildSNNGraph(sce.pbmc, k = 10, use.dimred = "PCA") clust <- igraph::cluster_walktrap(g)$membership sce.pbmc$cluster <- factor(clust) library(celldex) ref <- celldex::BlueprintEncodeData() library(SingleR) pred <- SingleR( test = sce.pbmc, ref = ref, labels = ref$label.main ) sce.pbmc$labels <- pred$labels plotTSNE(sce.pbmc, colour_by = "labels", text_by = "labels") plotTSNE(sce.pbmc, colour_by = "cluster", text_by = "labels") ```

I did notice a few small differences vs Pete's slides, like those few red cells to the right of the monocytes on his version and the shape of the B-cells cluster, but well, that's likely to other changes.

Wrapping up

So indeed, https://github.com/MarioniLab/DropletUtils/commit/584320df70b0c3f8991a1a1485b660225860f780 was the main source of the discrepancy. Since it's a bug fix, like you noted above and at https://github.com/MarioniLab/DropletUtils/commit/24e8073300a77eb75ec7919cef2bee3e10f93463, we can close this issue on the reproducibility of BioC 3.11 results.

`obs.P` calculation

However, regarding the obs.P calculation, while it didn't affect this issue, due to https://github.com/MarioniLab/DropletUtils/issues/67#issuecomment-896956815 and https://github.com/MarioniLab/DropletUtils/issues/67#issuecomment-896968214, should I send a (clean) PR for either https://github.com/lcolladotor/DropletUtils/commit/790d099508197fec4163a6909d42b4f5b6e7b715 or https://github.com/lcolladotor/DropletUtils/commit/2b0bb4bdfed87d538ef314f5a916f7d539936177 (whichever version you prefer). Or is the current obs.P behavior what you expect?

Best, Leo

LTLA commented 3 years ago

What's wrong with the obs.P?

MarioniLab / DropletUtils