MarioniLab / DropletUtils

Clone of the Bioconductor repository for the DropletUtils package.
https://bioconductor.org/packages/devel/bioc/html/DropletUtils.html
56 stars 27 forks source link

emptyDrops does not terminate #14

Closed s2hui closed 5 years ago

s2hui commented 5 years ago

Hello,

I just upgraded to Bioconductor version 3.9. When I run emptyDrops() on my count data it does not terminate (i.e. endless loop? goes on for hours). I have tried changing niters to be lower (i.e 100) but nothing seems to work.

When I revert back to Bioconductor version 3.8, emptyDrops works and terminates fine.

There are few emptyDrops detected overall (i.e. 5) but I would still like to know why it doesn't terminate when using the latest version. My count data is ~33K genes x ~1500 cells

Thanks shui

LTLA commented 5 years ago

You don't provide the call to emptyDrops(), or even the sessionInfo(). There's not enough information for me to make an educated diagnosis. So instead I'll just ~use my psychic powers~ guess as to what the problem might be:

s2hui commented 5 years ago

The output for sessionInfo() currently is:

R version 3.6.0 (2019-04-26) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14.4

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/Current/Resources/lib/libRlapack.dylib

Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding

locale: [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] DropletUtils_1.4.0 Matrix_1.2-17 dplyr_0.8.1
[4] Seurat_3.0.1 scran_1.12.0 EnsDb.Hsapiens.v86_2.99.0
[7] ensembldb_2.8.0 AnnotationFilter_1.8.0 GenomicFeatures_1.36.1
[10] AnnotationDbi_1.46.0 scater_1.12.2 ggplot2_3.1.1
[13] SingleCellExperiment_1.6.0 SummarizedExperiment_1.14.0 DelayedArray_0.10.0
[16] BiocParallel_1.18.0 matrixStats_0.54.0 Biobase_2.44.0
[19] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0 IRanges_2.18.0
[22] S4Vectors_0.22.0 BiocGenerics_0.30.0

loaded via a namespace (and not attached): [1] plyr_1.8.4 igraph_1.2.4.1 lazyeval_0.2.2
[4] splines_3.6.0 listenv_0.7.0 digest_0.6.19
[7] htmltools_0.3.6 viridis_0.5.1 gdata_2.18.0
[10] magrittr_1.5 memoise_1.1.0 cluster_2.0.9
[13] ROCR_1.0-7 limma_3.40.2 globals_0.12.4
[16] Biostrings_2.52.0 R.utils_2.8.0 prettyunits_1.0.2
[19] colorspace_1.4-1 blob_1.1.1 ggrepel_0.8.1
[22] crayon_1.3.4 RCurl_1.95-4.12 jsonlite_1.6
[25] survival_2.44-1.1 zoo_1.8-5 ape_5.3
[28] glue_1.3.1 gtable_0.3.0 zlibbioc_1.30.0
[31] XVector_0.24.0 BiocSingular_1.0.0 Rhdf5lib_1.6.0
[34] future.apply_1.2.0 HDF5Array_1.12.1 scales_1.0.0
[37] DBI_1.0.0 edgeR_3.26.3 bibtex_0.4.2
[40] Rcpp_1.0.1 metap_1.1 viridisLite_0.3.0
[43] progress_1.2.2 reticulate_1.12 dqrng_0.2.1
[46] bit_1.1-14 rsvd_1.0.0 SDMTools_1.1-221.1
[49] tsne_0.1-3 htmlwidgets_1.3 httr_1.4.0
[52] gplots_3.0.1.1 RColorBrewer_1.1-2 ica_1.0-2
[55] pkgconfig_2.0.2 XML_3.98-1.19 R.methodsS3_1.7.1
[58] locfit_1.5-9.1 dynamicTreeCut_1.63-1 labeling_0.3
[61] reshape2_1.4.3 tidyselect_0.2.5 rlang_0.3.4
[64] munsell_0.5.0 tools_3.6.0 RSQLite_2.1.1
[67] ggridges_0.5.1 stringr_1.4.0 npsurv_0.4-0
[70] bit64_0.9-7 fitdistrplus_1.0-14 caTools_1.17.1.2
[73] purrr_0.3.2 RANN_2.6.1 pbapply_1.4-0
[76] future_1.13.0 nlme_3.1-140 R.oo_1.22.0
[79] biomaRt_2.40.0 compiler_3.6.0 beeswarm_0.2.3
[82] plotly_4.9.0 curl_3.3 png_0.1-7
[85] lsei_1.2-0 tibble_2.1.1 statmod_1.4.30
[88] stringi_1.4.3 lattice_0.20-38 ProtGenerics_1.16.0
[91] pillar_1.4.0 Rdpack_0.11-0 lmtest_0.9-37
[94] BiocNeighbors_1.2.0 data.table_1.12.2 cowplot_0.9.4
[97] bitops_1.0-6 irlba_2.3.3 gbRd_0.4-11
[100] rtracklayer_1.44.0 R6_2.4.0 KernSmooth_2.23-15
[103] gridExtra_2.3 vipor_0.4.5 codetools_0.2-16
[106] MASS_7.3-51.4 gtools_3.8.1 assertthat_0.2.1
[109] rhdf5_2.28.0 withr_2.1.2 sctransform_0.2.0
[112] GenomicAlignments_1.20.0 Rsamtools_2.0.0 GenomeInfoDbData_1.2.1
[115] hms_0.4.2 grid_3.6.0 tidyr_0.8.3
[118] DelayedMatrixStats_1.6.0 Rtsne_0.15 ggbeeswarm_0.6.0

s2hui commented 5 years ago

And the call to emptyDrops is:

emptyDrops(counts(sce))

LTLA commented 5 years ago

I don't have much insight to give here. Your call and session information look fine to me. I have no problems running emptyDrops on a Mac, and it works fine on the Bioconductor build system.

s2hui commented 5 years ago

Could I send you the SCE RData object that reproduces this issue? thanks shui

On Mon, May 27, 2019 at 2:53 PM Aaron Lun notifications@github.com wrote:

I don't have much insight to give here. Your call and session information look fine to me. I have no problems running emptyDrops on a Mac, and it works fine on the Bioconductor build system.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MarioniLab/DropletUtils/issues/14?email_source=notifications&email_token=ACTP5BUXXCAE5XJL5Y3FFFTPXQU3DA5CNFSM4HP37BDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKLM4Y#issuecomment-496285299, or mute the thread https://github.com/notifications/unsubscribe-auth/ACTP5BTNP56JJ5PTD2HF4I3PXQU3DANCNFSM4HP37BDA .

LTLA commented 5 years ago

I just noticed that your count matrix only has 1500 cells. You should really be running it on the raw count matrix prior to any cell calling; for a 10X experiment, this should have ~750,000 barcodes. emptyDrops() will not yield sensible results if you're calling it on barcodes that have already been filtered!

So the real solution is to use the raw count matrix as input into emptyDrops(). Nonetheless, the lack of termination may be a symptom of another problem, so if you can send me the offending count matrix, I can have a look at it. (Best to put it on dropbox or somewhere, rather than trying to email a large file.)

s2hui commented 5 years ago

Hello,

Yes it has a small number of cells as I am QCing per batch and batch 5 only has 1588 cells?? The call to emptyDrops is the first call I make in QCing so prior to the call it is simply the raw count data loaded in.

The RData object is here: https://drive.google.com/open?id=1w3LVafFU8Bw3NYSLVhHsGuKD_VBx0ktl

When I run the following with Bioconductor 3.8 (in 3.9 the call to emptyDrops hangs)

set.seed(100) fdrCutoff = 0.01 e.out <- emptyDrops(counts(sce.i)) sum(e.out$FDR <= fdrCutoff, na.rm=TRUE) print(paste0("Number empty drops: ",length(which(e.out$FDR > fdrCutoff)), " out of " , dim(sce.i)[2]))

[1] "Number empty drops: 1212 out of 1588"

counts(sce.i) 33694 x 1588 sparse Matrix of class "dgCMatrix"

sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS 10.14.4

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale: [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base

other attached packages: [1] DropletUtils_1.2.2 dplyr_0.8.1
[3] scran_1.10.2 EnsDb.Hsapiens.v86_2.99.0
[5] ensembldb_2.6.8 AnnotationFilter_1.6.0
[7] GenomicFeatures_1.34.8 AnnotationDbi_1.44.0
[9] scater_1.10.1 Seurat_2.3.4
[11] Matrix_1.2-17 cowplot_0.9.4
[13] ggplot2_3.1.1 SingleCellExperiment_1.4.1 [15] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
[17] BiocParallel_1.16.6 matrixStats_0.54.0
[19] Biobase_2.42.0 GenomicRanges_1.34.0
[21] GenomeInfoDb_1.18.2 IRanges_2.16.0
[23] S4Vectors_0.20.1 BiocGenerics_0.28.0

loaded via a namespace (and not attached): [1] snow_0.4-3 backports_1.1.4 Hmisc_4.2-0
[4] plyr_1.8.4 igraph_1.2.4.1 lazyeval_0.2.2
[7] splines_3.5.0 digest_0.6.19 foreach_1.4.4
[10] htmltools_0.3.6 viridis_0.5.1 lars_1.2
[13] gdata_2.18.0 memoise_1.1.0 magrittr_1.5
[16] checkmate_1.9.3 cluster_2.0.9 mixtools_1.1.0
[19] ROCR_1.0-7 limma_3.38.3 Biostrings_2.50.2
[22] R.utils_2.8.0 prettyunits_1.0.2 colorspace_1.4-1
[25] blob_1.1.1 xfun_0.7 crayon_1.3.4
[28] RCurl_1.95-4.12 jsonlite_1.6 survival_2.44-1.1
[31] zoo_1.8-6 iterators_1.0.10 ape_5.3
[34] glue_1.3.1 gtable_0.3.0 zlibbioc_1.28.0
[37] XVector_0.22.0 kernlab_0.9-27 Rhdf5lib_1.4.3
[40] prabclus_2.2-7.1 DEoptimR_1.0-8 HDF5Array_1.10.1
[43] scales_1.0.0 mvtnorm_1.0-10 edgeR_3.24.3
[46] DBI_1.0.0 bibtex_0.4.2 Rcpp_1.0.1
[49] metap_1.1 dtw_1.20-1 progress_1.2.2
[52] viridisLite_0.3.0 htmlTable_1.13.1 reticulate_1.12
[55] foreign_0.8-71 bit_1.1-14 proxy_0.4-23
[58] mclust_5.4.3 SDMTools_1.1-221.1 Formula_1.2-3
[61] tsne_0.1-3 htmlwidgets_1.3 httr_1.4.0
[64] gplots_3.0.1.1 RColorBrewer_1.1-2 fpc_2.2-1
[67] acepack_1.4.1 modeltools_0.2-22 ica_1.0-2
[70] XML_3.98-1.19 pkgconfig_2.0.2 R.methodsS3_1.7.1
[73] flexmix_2.3-15 nnet_7.3-12 locfit_1.5-9.1
[76] dynamicTreeCut_1.63-1 tidyselect_0.2.5 rlang_0.3.4
[79] reshape2_1.4.3 munsell_0.5.0 tools_3.5.0
[82] RSQLite_2.1.1 ggridges_0.5.1 stringr_1.4.0
[85] yaml_2.2.0 npsurv_0.4-0 knitr_1.23
[88] bit64_0.9-7 fitdistrplus_1.0-14 robustbase_0.93-5
[91] caTools_1.17.1.2 purrr_0.3.2 RANN_2.6.1
[94] pbapply_1.4-0 nlme_3.1-140 R.oo_1.22.0
[97] biomaRt_2.38.0 hdf5r_1.2.0 compiler_3.5.0
[100] rstudioapi_0.10 curl_3.3 beeswarm_0.2.3
[103] png_0.1-7 lsei_1.2-0 statmod_1.4.32
[106] tibble_2.1.2 stringi_1.4.3 lattice_0.20-38
[109] ProtGenerics_1.14.0 pillar_1.4.1 BiocManager_1.30.4
[112] Rdpack_0.11-0 lmtest_0.9-37 BiocNeighbors_1.0.0
[115] data.table_1.12.2 bitops_1.0-6 irlba_2.3.3
[118] gbRd_0.4-11 rtracklayer_1.42.2 R6_2.4.0
[121] latticeExtra_0.6-28 KernSmooth_2.23-15 gridExtra_2.3
[124] vipor_0.4.5 codetools_0.2-16 MASS_7.3-51.4
[127] gtools_3.8.1 assertthat_0.2.1 rhdf5_2.26.2
[130] withr_2.1.2 GenomicAlignments_1.18.1 Rsamtools_1.34.1
[133] GenomeInfoDbData_1.2.0 hms_0.4.2 diptest_0.75-7
[136] doSNOW_1.0.16 grid_3.5.0 rpart_4.1-15
[139] tidyr_0.8.3 class_7.3-15 DelayedMatrixStats_1.4.0 [142] segmented_0.5-4.0 Rtsne_0.15 base64enc_0.1-3
[145] ggbeeswarm_0.6.0

LTLA commented 5 years ago

I just want the count matrix that you're putting into emptyDrops. Surely the count matrix is not 2.3 GB?

LTLA commented 5 years ago

Well, your Rdata file doesn't even have an sce.i object in it. It's just got a giant whopping Seurat object named so. I'll assume you're trying to run emptyDrops on so@raw_data, and I can more-or-less reproduce the lack of termination with:

load("so.qc.normsize.lognorm.scale.pca.tsne.res1.demarkers.12746.RCCTumour.RData")
library(DropletUtils)
system.time(out <- emptyDrops(so@raw.data)) 

The cause of this phenomenon is technically complex. For those who are interested, it seems that Boost's uniform_distribution class really doesn't like it when the maximum and minimum values are the same. This causes an infinite loop, presumably somewhere inside the class. I don't know whether this is a known behaviour or not, certainly the docs suggest that max can be equal to min.

However, the exact cause is irrelevant. Regardless of whether it terminates or not, what you are doing is almost certainly wrong. Assuming this is 10X data, you need to be giving emptyDrops the RAW, UNFILTERED count matrix that comes out of CellRanger. This should have about 750000 columns, not just 12000. You are currently trying to detect empty droplets in a count matrix that has already been filtered to remove empty droplets; the results will be nonsensical.

In fact, the real question is why the function ever worked in the first place - because it shouldn't have. (The problem with Boost only arises because the sum of ambient counts is zero.) So, I will add extra error checks to provide some defense against this, whereby the function will simply quit with an error if there are no low-count libraries available to estimate the ambient profile.

s2hui commented 5 years ago

My apologies, the RData is actually here (I sent the wrong link): https://drive.google.com/open?id=15fSAvOA4rPGY8WOj-Ibvx1yz8aZb7Sms

It is 7-8MB. Sorry about that. Please find a single cell experiment object (sce) that is ~33K x 1588. It has not been QCd or normalized and contains raw counts.

shui

LTLA commented 5 years ago

Perhaps I was less than clear.

emptyDrops() needs the unfiltered matrix. If you only have 1588 cells, your matrix has already been filtered by CellRanger. You need the UNFILTERED matrix. Please, read the description HERE.

If you are trying to run emptyDrops() on this 1588-cell matrix, what you are doing is WRONG. Even if the function were to return something, the results would be NONSENSE.

I have now modified the function so that you CANNOT do what you are trying to do. The function will simply fail with an error message. This is the only way to protect you from yourself.

s2hui commented 5 years ago

Yes you were less than clear to me but thanks for your help anyways.

shui