Open lcolladotor opened 3 years ago
From my embedded tweet that doesn't show up properly on GitHub
Peter Hickey's slides, likely made with BioC 3.11
BioC 3.13
I haven't tried the whole thing up to annotating the cell clusters with BioC 3.12, but the shape of the t-SNE is the same as in BioC 3.13.
FWIW I'm pretty certain figures in my slides are just copy+pasted from the version of OSCA at the time I made the slides.
Thanks for the info Pete!
Here's a verification that the t-SNE shape issue is due to DropletUtils::emptyDrops()
since saving the e.out <- emptyDrops()
output in BioC 3.11 and using it with 3.13 returns the same shape of the t-SNE.
The annotated clusters and all that are different, likely due to other changes.
Left: BioC 3.13 with e.out
from BioC 3.11
Middle: Pete's slides, likely with BioC 3.11
Right: BioC 3.13
Actually, the colors was due to
plotTSNE(sce.pbmc, colour_by = "labels", text_by = "labels")
vs
plotTSNE(sce.pbmc, colour_by = "cluster", text_by = "labels")
Left: BioC 3.13 with e.out from BioC 3.11 Right: Pete's slides
Show below is the difference between using BioC 3.13 with the old output from DropletUtils::emptyDrops()
, which matches the BioC 3.11 output, and the new BioC 3.13 output. In terms of explaining methods, I don't think that the BioC 3.13 output is bad in any way. So we could potentially close this issue. Though in terms of reproducing DropletUtils::emptyDrops()
output from BioC 3.11 with newer versions, well, that's a different story. My guess is that the answer is that it simply can't, like I said earlier.
Left: BioC 3.13 with e.out from BioC 3.111 Middle: Pete's slides with BioC 3.11 Right: BioC 3.13 only
Ohh thanks!!
El mar., 10 de agosto de 2021 22:25, Leonardo Collado-Torres < @.***> escribió:
Actually, the colors was due to
plotTSNE(sce.pbmc, colour_by = "labels", text_by = "labels")
vs
plotTSNE(sce.pbmc, colour_by = "cluster", text_by = "labels")
[image: Screen Shot 2021-08-10 at 11 10 30 PM] https://user-images.githubusercontent.com/2288213/128963828-314c8272-bc68-488a-877b-1356e2aa7a44.png
Left: BioC 3.13 with e.out from BioC 3.11 Right: Pete's slides
Show below is the difference between using BioC 3.13 with the old output from DropletUtils::emptyDrops(), which matches the BioC 3.11 output, and the new BioC 3.13 output. In terms of explaining methods, I don't think that the BioC 3.13 output is bad in any way. So we could potentially close this issue. Though in terms of reproducing DropletUtils::emptyDrops() output from BioC 3.11 with newer versions, well, that's a different story. My guess is that the answer is that it simply can't, like I said earlier.
[image: Screen Shot 2021-08-10 at 11 20 16 PM] https://user-images.githubusercontent.com/2288213/128964570-6873f9c5-af6e-41bf-b326-23224e5b0302.png
Left: BioC 3.13 with e.out from BioC 3.111 Middle: Pete's slides with BioC 3.11 Right: BioC 3.13 only
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MarioniLab/DropletUtils/issues/67#issuecomment-896470899, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUHQZXGRU6IS2FGRQOTYIDT4HUS5ANCNFSM5B5IWV7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Well, if your scope is "somewhere between April and Oct 2020", then the more impactful change is that of #48, which adjusts the spline fit and presumably shifts the retain
threshold a bit. You could attempt reverting this change by setting barcode.args=list(exclude.from=0)
, see the argument description in ?barcodeRanks
.
I see, thanks Aaron. I wasn't aware of that change, though well, with BioC 3.13 I get the same results with and without barcode.args=list(exclude.from=0)
. That is:
> e.out <- emptyDrops(counts(sce.pbmc), barcode.args=list(exclude.from=0))
> table(e.out$FDR < 0.001, useNA = "ifany")
FALSE TRUE <NA>
989 4300 731991
Under a first inspection, it doesn't seem to be that the following commit is the problem https://github.com/MarioniLab/DropletUtils/commit/71b161e44c61efa96226e04adef32a58d4c190db#diff-cf7a6dd4088835e532ce821c61ceb633ccc99ee3ffe8a812ab3c0a7f46dcaa8b either based on the test below and from looking at https://github.com/LTLA/beachmat/blob/d3e4bafbc840df2d106197d7163c3f3aa6fb1fb1/R/whichNonZero.R#L42-L44.
set.seed(100)
p.n0 <- runif(100)
j <- sample(1:10, 100, TRUE)
mat <- matrix(ncol = 10)
dim(mat)
#> [1] 1 10
max(j)
#> [1] 10
by.col <- aggregate(p.n0, list(Col=j), sum)
obs.P <- numeric(ncol(mat))
obs.P[by.col$Col] <- by.col$x
obs.P
#> [1] 3.998302 4.518524 2.379189 5.237262 6.077561 4.126877 4.634267
#> [8] 4.671363 4.275538 12.068117
j_new <- factor(j, levels=seq_len(ncol(mat)))
obs.P_new <- tapply(p.n0, INDEX=j_new, FUN=sum)
obs.P_new <- as.numeric(obs.P_new)
obs.P_new
#> [1] 3.998302 4.518524 2.379189 5.237262 6.077561 4.126877 4.634267
#> [8] 4.671363 4.275538 12.068117
testthat::expect_equal(obs.P, obs.P_new)
testthat::expect_equivalent(obs.P, obs.P_new)
Created on 2021-08-11 by the reprex package (v2.0.1)
However, in that case, the full matrix mat
has an equal number of columns as the max value of j
and all columns of mat
have j
values.
However, let's say that the full matrix has 15 columns with columns 6 to 10 being all zeros.
## 100 random p-values for 100 random j columns
## in a sparse matrix
set.seed(100)
p.n0 <- runif(100)
## Let's say that we have empty colums 6 through 10
j <- sample(c(1:5, 11:15), 100, TRUE)
mat <- matrix(ncol = max(j))
dim(mat)
#> [1] 1 15
by.col <- aggregate(p.n0, list(Col=j), sum)
obs.P <- numeric(ncol(mat))
obs.P[by.col$Col] <- by.col$x
obs.P
#> [1] 3.998302 4.518524 2.379189 5.237262 6.077561 0.000000 0.000000
#> [8] 0.000000 0.000000 0.000000 4.126877 4.634267 4.671363 4.275538
#> [15] 12.068117
j_new <- factor(j, levels=seq_len(ncol(mat)))
obs.P_new <- tapply(p.n0, INDEX=j_new, FUN=sum)
obs.P_new <- as.numeric(obs.P_new)
obs.P_new
#> [1] 3.998302 4.518524 2.379189 5.237262 6.077561 NA NA
#> [8] NA NA NA 4.126877 4.634267 4.671363 4.275538
#> [15] 12.068117
testthat::expect_equal(obs.P, obs.P_new)
#> Error: `obs.P` not equal to `obs.P_new`.
#> 5/15 mismatches (average diff: NaN)
#> [6] 0 - NA == NA
#> [7] 0 - NA == NA
#> [8] 0 - NA == NA
#> [9] 0 - NA == NA
#> [10] 0 - NA == NA
testthat::expect_equivalent(obs.P, obs.P_new)
#> Error: `obs.P` not equivalent to `obs.P_new`.
#> 5/15 mismatches (average diff: NaN)
#> [6] 0 - NA == NA
#> [7] 0 - NA == NA
#> [8] 0 - NA == NA
#> [9] 0 - NA == NA
#> [10] 0 - NA == NA
Created on 2021-08-11 by the reprex package (v2.0.1)
In that case, we end up having NA
s in the obs.P_new
object. The current version of the code would be https://github.com/MarioniLab/DropletUtils/blob/master/R/emptyDrops.R#L311-L327.
With the data used for this issue, I see that well, we do have lots of non-zero columns as expected.
where max(j)
corresponds to ncol(mat)
.
> library(beachmat)
>
> mat <- counts(sce.pbmc)
> dim(mat)
[1] 33694 737280
> nonzero <- whichNonZero(mat)
> length(unique(nonzero$j))
[1] 272442
> max(nonzero$j)
[1] 737280
pressed the wrong button while drafting my message....
Ok, this does seem to be the issue with the actual data as shown in the reprex
below with BioC 3.13.
I imagine that it doesn't matter for most droplets since well, they already had NA
p-values in the past.
# Usemos datos de pbmc4k
library(BiocFileCache)
#> Loading required package: dbplyr
bfc <- BiocFileCache()
raw.path <- bfcrpath(bfc, file.path(
"http://cf.10xgenomics.com/samples",
"cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"
))
untar(raw.path, exdir = file.path(tempdir(), "pbmc4k"))
library(DropletUtils)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#>
#> clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#> clusterExport, clusterMap, parApply, parCapply, parLapply,
#> parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#> dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#> grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#> order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#> rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#> union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#>
#> expand.grid, I, unname
#> Loading required package: IRanges
#>
#> Attaching package: 'IRanges'
#> The following object is masked from 'package:grDevices':
#>
#> windows
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#>
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
library(Matrix)
#>
#> Attaching package: 'Matrix'
#> The following object is masked from 'package:S4Vectors':
#>
#> expand
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names = TRUE)
# Anotación de los genes
library(scater)
#> Loading required package: scuttle
#> Loading required package: ggplot2
rownames(sce.pbmc) <- uniquifyFeatureNames(
rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol
)
library(beachmat)
mat <- counts(sce.pbmc)
dim(mat)
#> [1] 33694 737280
nonzero <- whichNonZero(mat)
length(unique(nonzero$j))
#> [1] 272442
max(nonzero$j)
#> [1] 737280
m <- DropletUtils:::.rounded_to_integer(mat, 0)
nonzero <- whichNonZero(m)
i <- nonzero$i
j <- nonzero$j
x <- nonzero$x
set.seed(100)
p.n0 <- runif(length(j))
by.col <- aggregate(p.n0, list(Col=j), sum)
obs.P <- numeric(ncol(mat))
obs.P[by.col$Col] <- by.col$x
summary(obs.P)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0000 0.0000 0.0000 5.7792 0.7371 2622.1900
j_new <- factor(j, levels=seq_len(ncol(m)))
obs.P_new <- tapply(p.n0, INDEX=j_new, FUN=sum)
obs.P_new <- as.numeric(obs.P_new)
summary(obs.P_new)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0.0 0.6 1.3 15.6 9.1 2622.2 464838
testthat::expect_equal(obs.P, obs.P_new)
#> Error: `obs.P` not equal to `obs.P_new`.
#> 464838/737280 mismatches (average diff: NaN)
#> [2] 0 - NA == NA
#> [4] 0 - NA == NA
#> [7] 0 - NA == NA
#> [8] 0 - NA == NA
#> [9] 0 - NA == NA
#> [11] 0 - NA == NA
#> [16] 0 - NA == NA
#> [17] 0 - NA == NA
#> [18] 0 - NA == NA
#> ...
testthat::expect_equivalent(obs.P, obs.P_new)
#> Error: `obs.P` not equivalent to `obs.P_new`.
#> 464838/737280 mismatches (average diff: NaN)
#> [2] 0 - NA == NA
#> [4] 0 - NA == NA
#> [7] 0 - NA == NA
#> [8] 0 - NA == NA
#> [9] 0 - NA == NA
#> [11] 0 - NA == NA
#> [16] 0 - NA == NA
#> [17] 0 - NA == NA
#> [18] 0 - NA == NA
#> ...
Created on 2021-08-11 by the reprex package (v2.0.1)
Adding obs.P[is.na(obs.P)] <- 0
or obs.P[!seq_len(ncol(block)) %in% unique(j)] <- 0
resolves the issue (with object names from https://github.com/MarioniLab/DropletUtils/blob/master/R/emptyDrops.R#L311-L327) in the reprex.
## 100 random p-values for 100 random j columns
## in a sparse matrix
set.seed(100)
p.n0 <- runif(100)
## Let's say that we have empty colums 6 through 10
j <- sample(c(1:5, 11:15), 100, TRUE)
mat <- matrix(ncol = max(j))
dim(mat)
#> [1] 1 15
by.col <- aggregate(p.n0, list(Col=j), sum)
obs.P <- numeric(ncol(mat))
obs.P[by.col$Col] <- by.col$x
j_new <- factor(j, levels=seq_len(ncol(mat)))
obs.P_new <- tapply(p.n0, INDEX=j_new, FUN=sum)
obs.P_new <- as.numeric(obs.P_new)
## This works
# obs.P_new[is.na(obs.P_new)] <- 0
## But well, maybe there are NAs in the original p.n0 already
## Here's a more robust fix to simply add 0
## to the columns with no data
obs.P_new[!seq_len(ncol(mat)) %in% unique(j)] <- 0
data.frame(obs.P, obs.P_new)
#> obs.P obs.P_new
#> 1 3.998302 3.998302
#> 2 4.518524 4.518524
#> 3 2.379189 2.379189
#> 4 5.237262 5.237262
#> 5 6.077561 6.077561
#> 6 0.000000 0.000000
#> 7 0.000000 0.000000
#> 8 0.000000 0.000000
#> 9 0.000000 0.000000
#> 10 0.000000 0.000000
#> 11 4.126877 4.126877
#> 12 4.634267 4.634267
#> 13 4.671363 4.671363
#> 14 4.275538 4.275538
#> 15 12.068117 12.068117
testthat::expect_equal(obs.P, obs.P_new)
testthat::expect_equivalent(obs.P, obs.P_new)
Created on 2021-08-11 by the reprex package (v2.0.1)
This is true with the full data too.
Next, I'll try on a fresh clone of DropletUtils
with the above fix and see if it works.
Bah, I still can't get the BioC 3.11 results even with these changes :/
> set.seed(100)
> e.out <- emptyDrops(counts(sce.pbmc), barcode.args=list(exclude.from=0))
> table(e.out$FDR < 0.001, useNA = "ifany")
FALSE TRUE <NA>
989 4300 731991
It didn't work with https://github.com/lcolladotor/DropletUtils/commit/790d099508197fec4163a6909d42b4f5b6e7b715
nor https://github.com/lcolladotor/DropletUtils/commit/2b0bb4bdfed87d538ef314f5a916f7d539936177, both with BioC 3.14 (bioc-devel).
I would suggest looking at 584320df70b0c3f8991a1a1485b660225860f780. Reverting that change and running:
> set.seed(100)
> e.out <- emptyDrops(counts(sce.pbmc), barcode.args=list(exclude.from=0))
> table(e.out$FDR < 0.001, useNA = "ifany")
FALSE TRUE <NA>
1056 4233 731991
This is mentioned briefly in the NEWS
.
Oh interesting, thanks Aaron! I'll do that tomorrow morning. Thanks!
Hi Aaron,
Using https://github.com/lcolladotor/DropletUtils/commit/eb8eb45bb2f9571931435ab5f382f56b99e46c71 which reverts https://github.com/MarioniLab/DropletUtils/commit/584320df70b0c3f8991a1a1485b660225860f780 on BioC 3.14 I do indeed get the same results I get with BioC 3.13 using the emptyDrops()
output from BioC 3.11. Thanks!
e.out_BioC3.14_lcolladotor_eb8eb45.rds.zip
I did notice a few small differences vs Pete's slides, like those few red cells to the right of the monocytes on his version and the shape of the B-cells cluster, but well, that's likely to other changes.
So indeed, https://github.com/MarioniLab/DropletUtils/commit/584320df70b0c3f8991a1a1485b660225860f780 was the main source of the discrepancy. Since it's a bug fix, like you noted above and at https://github.com/MarioniLab/DropletUtils/commit/24e8073300a77eb75ec7919cef2bee3e10f93463, we can close this issue on the reproducibility of BioC 3.11 results.
obs.P
calculationHowever, regarding the obs.P
calculation, while it didn't affect this issue, due to https://github.com/MarioniLab/DropletUtils/issues/67#issuecomment-896956815 and https://github.com/MarioniLab/DropletUtils/issues/67#issuecomment-896968214, should I send a (clean) PR for either https://github.com/lcolladotor/DropletUtils/commit/790d099508197fec4163a6909d42b4f5b6e7b715 or https://github.com/lcolladotor/DropletUtils/commit/2b0bb4bdfed87d538ef314f5a916f7d539936177 (whichever version you prefer). Or is the current obs.P
behavior what you expect?
Best, Leo
What's wrong with the obs.P
?
Hi Aaron et al,
We (@Yalbibalderas @AnaBVA) noticed that the following code from OSCA returns different t-SNE plots. Eventually we narrowed down the issue to BioC 3.11 vs 3.12 (3.13 is the same as 3.12). In particular, the issue is with changes between those versions for
DropletUtils
.The path to
DropletUtils
is documented at https://twitter.com/lcolladotor/status/1425242252872409092?s=20.You can reproduce this with:
where BioC 3.11 returns
and later versions return
The full code is below for getting to the tSNE (adapted from https://comunidadbioinfo.github.io/cdsb2021_scRNAseq/anotaci%C3%B3n-de-clusters-de-c%C3%A9lulas.html and ultimately from OSCA at https://bioconductor.org/books/release/OSCA/unfiltered-human-pbmcs-10x-genomics.html) is below.
We think that it comes down to
DropletUtils::emptyDrops()
and between April 27 and October 27 2020 (when BioC 3.12 was bioc-devel), there's 2 commits that altered it, in particular we think that https://github.com/MarioniLab/DropletUtils/commit/71b161e44c61efa96226e04adef32a58d4c190db might be the root of the difference.Based on the history, we see that commit was early in the BioC 3.12 devel cycle. With that in mind, I did a chimera installation with BioC 3.12 and the BioC 3.11 version of
DropletUtils
, specifically https://github.com/MarioniLab/DropletUtils/commit/51d00b01dc8d65266bd00232b0420d893d52f6ea (remotes::install_github("MarioniLab/DropletUtils@51d00b0")
). That version reproduced the t-SNE andDropletUtils::emptyDrops()
we see with BioC 3.11.I tried installing this version of DropletUtils https://github.com/MarioniLab/DropletUtils/commit/71b161e44c61efa96226e04adef32a58d4c190db but I can't due some compilation errors related to
scuttle
, which makes sense since in BioC 3.12 you changed many packages. I also tried a chimera of BioC 3.11 and upgradingDropletUtils
to BioC 3.12 versions (withremotes::install_github()
) around the time of the suspected root commit and related packages, but well, I couldn't get it to work tonight.In any case, do you think that there's a way to get the same
emptyDrops()
results with current versions ofDropletUtils
as those we were getting back in BioC 3.11? My sense is that the answer is no since the internals of the function changed. Aka, it's not like a parameter changed and we can just change it from sayFALSE
toTRUE
, etc.All of this arose due to how in BioC 3.11, the clusters of CD4 and CD8 T-cells looked nicely separated in the t-SNE and don't with current versions as shown below.
And well, we wanted to be able that question when we teach that part of OSCA on Friday morning. Best, Leo