hashedDrops not outputing the correct Best/Second

zebasilio commented 5 months ago

Dear DropletUtis Team,

Thank you for your package. I was using hashedDrops() for doublet detection in a 10x genomics single-cell RNAseq multiplexed experiment, when I observed that the function was not correctly recognizing the "Best", neither "Second". I simulated some data matrix and here is the result. Thank you for your help. All the best, José Basílio

x <- matrix(sample(1:500, 100), nrow = 10) x [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 93 431 23 183 273 130 397 414 307 140 [2,] 171 270 448 286 254 150 3 269 491 53 [3,] 350 311 374 429 225 106 373 464 245 136 [4,] 428 471 354 316 182 191 73 49 485 39 [5,] 195 31 241 378 122 142 84 249 77 18 [6,] 119 336 337 100 91 359 416 465 413 139 [7,] 403 153 83 88 65 204 22 36 21 252 [8,] 44 380 390 94 422 156 194 181 256 372 [9,] 430 268 367 71 74 302 96 327 382 257 [10,] 61 263 426 253 353 383 51 59 109 349 hash_stats <- DropletUtils::hashedDrops(x) hash_stats DataFrame with 10 rows and 7 columns Total Best Second LogFC LogFC2 Doublet Confident
1 2294 7 4 0.1302685 0.935836 FALSE FALSE 2 2914 1 2 0.2138469 1.027222 FALSE FALSE 3 3043 2 4 0.8723392 0.465449 FALSE FALSE 4 2198 2 5 0.2710261 0.891623 FALSE FALSE 5 2061 2 8 0.0720561 1.181088 FALSE FALSE 6 2123 2 10 0.0487423 0.640153 FALSE FALSE 7 1709 1 6 0.4124690 1.895075 FALSE FALSE 8 2513 1 2 0.1779031 1.034232 FALSE FALSE 9 2786 2 4 0.4587688 1.170009 FALSE FALSE 10 1755 7 8 0.0618384 0.435935 FALSE FALSE sessionInfo() R version 4.4.0 (2024-04-24) Platform: x86_64-pc-linux-gnu Running under: Debian GNU/Linux 12 (bookworm)

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Vienna tzcode source: system (glibc)

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] scater_1.32.0 scran_1.32.0 scuttle_1.14.0 DropletUtils_1.24.0
[5] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0 Biobase_2.64.0 GenomicRanges_1.56.0
[9] GenomeInfoDb_1.40.0 IRanges_2.38.0 S4Vectors_0.42.0 BiocGenerics_0.50.0
[13] MatrixGenerics_1.16.0 matrixStats_1.3.0 lubridate_1.9.3 forcats_1.0.0
[17] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[21] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0

loaded via a namespace (and not attached): [1] tidyselect_1.2.1 viridisLite_0.4.2 vipor_0.4.7 viridis_0.6.5 R.utils_2.12.3
[6] bluster_1.14.0 pacman_0.5.1 rsvd_1.0.5 timechange_0.3.0 lifecycle_1.0.4
[11] cluster_2.1.6 statmod_1.5.0 magrittr_2.0.3 compiler_4.4.0 rlang_1.1.3
[16] tools_4.4.0 igraph_2.0.3 utf8_1.2.4 S4Arrays_1.4.1 dqrng_0.4.0
[21] DelayedArray_0.30.1 abind_1.4-5 BiocParallel_1.38.0 HDF5Array_1.32.0 withr_3.0.0
[26] R.oo_1.26.0 grid_4.4.0 fansi_1.0.6 beachmat_2.20.0 colorspace_2.1-0
[31] Rhdf5lib_1.26.0 edgeR_4.2.0 scales_1.3.0 cli_3.6.2 crayon_1.5.2
[36] generics_0.1.3 metapod_1.12.0 rstudioapi_0.16.0 httr_1.4.7 tzdb_0.4.0
[41] DelayedMatrixStats_1.26.0 ggbeeswarm_0.7.2 rhdf5_2.48.0 zlibbioc_1.50.0 parallel_4.4.0
[46] BiocManager_1.30.23 XVector_0.44.0 vctrs_0.6.5 Matrix_1.7-0 jsonlite_1.8.8
[51] BiocSingular_1.20.0 BiocNeighbors_1.22.0 hms_1.1.3 ggrepel_0.9.5 beeswarm_0.4.0
[56] irlba_2.3.5.1 locfit_1.5-9.9 limma_3.60.2 glue_1.7.0 codetools_0.2-20
[61] stringi_1.8.4 gtable_0.3.5 UCSC.utils_1.0.0 ScaledMatrix_1.12.0 munsell_0.5.1
[66] pillar_1.9.0 rhdf5filters_1.16.0 GenomeInfoDbData_1.2.12 R6_2.5.1 sparseMatrixStats_1.16.0 [71] lattice_0.22-6 R.methodsS3_1.8.2 Rcpp_1.0.12 gridExtra_2.3 SparseArray_1.4.5
[76] pkgconfig_2.0.3

LTLA commented 5 months ago

The algorithm is not as simple as just doing max.col. It tries to find the ambient concentration of each HTO so that differences in HTO molarity (e.g., due to incorrect mixing) does not bias the calls. The ambient concentration is then used to normalize each HTO's count within each cell before choosing Best and Second.

We attempt to infer the ambient concentration from the data, but in your toy example, you don't have enough cells to do that. If you have a better estimate of the ambient concentration (e.g., from the empty droplets, or if you are confident that the concentrations are equal), you can set ambient= to force the function to respect your wishes.

zebasilio commented 3 months ago

Thank you so much for your explanation.

MarioniLab / DropletUtils

hashedDrops not outputing the correct Best/Second #108