NathanSkene / EWCE

Expression Weighted Celltype Enrichment. See the package website for up-to-date instructions on usage.
https://nathanskene.github.io/EWCE/index.html
53 stars 25 forks source link

generate_celltype_data struggling with a large matrix: "too many elements (>= 2^31) selected along dimension 1 of array" #91

Open ainefairbrother opened 8 months ago

ainefairbrother commented 8 months ago

1. Bug description

When running generate_celltype_data() on a gene-by-cell matrix of 34807 x 786896, I get a large array error. My input object (exp) is sparse DelayedMatrix object of type "double".

Console output

100 core(s) assigned as workers (156 reserved).
Converting to sparse matrix.
Error in h(simpleError(msg, call)) :
  error in evaluating the argument 'x' in selecting a method for function 'drop': too many elements (>= 2^31) selected along dimension 1 of array

Session info

``` R version 4.3.0 (2023-04-21) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 22.04.2 LTS Matrix products: default BLAS/LAPACK: /home/drihome/MRAineFairbrotherBrowne/miniconda3/envs/R/lib/libopenblasp-r0.3.23.so; LAPACK version 3.11.0 locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C time zone: Etc/UTC tzcode source: system (glibc) attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scuttle_1.10.1 DelayedMatrixStats_1.22.1 [3] HDF5Array_1.28.1 rhdf5_2.44.0 [5] DelayedArray_0.26.6 S4Arrays_1.0.4 [7] Matrix_1.6-1 DESeq2_1.40.2 [9] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2 [11] Biobase_2.60.0 GenomicRanges_1.52.0 [13] GenomeInfoDb_1.36.1 IRanges_2.34.1 [15] S4Vectors_0.38.1 MatrixGenerics_1.12.2 [17] matrixStats_1.0.0 lubridate_1.9.2 [19] forcats_1.0.0 stringr_1.5.0 [21] dplyr_1.1.3 purrr_1.0.1 [23] readr_2.1.4 tidyr_1.3.0 [25] tibble_3.2.1 ggplot2_3.4.3 [27] tidyverse_2.0.0 ewceData_1.8.0 [29] ExperimentHub_2.8.1 AnnotationHub_3.8.0 [31] BiocFileCache_2.8.0 dbplyr_2.3.3 [33] BiocGenerics_0.46.0 EWCE_1.8.2 [35] RNOmni_1.0.1.2 loaded via a namespace (and not attached): [1] jsonlite_1.8.7 magrittr_2.0.3 [3] fs_1.6.3 zlibbioc_1.46.0 [5] vctrs_0.6.3 memoise_2.0.1 [7] RCurl_1.98-1.12 ggtree_3.8.2 [9] rstatix_0.7.2 htmltools_0.5.5 [11] curl_5.0.1 broom_1.0.5 [13] Rhdf5lib_1.22.0 gridGraphics_0.5-1 [15] basilisk_1.12.1 htmlwidgets_1.6.2 [17] HGNChelper_0.8.1 plyr_1.8.8 [19] plotly_4.10.3 cachem_1.0.8 [21] mime_0.12 lifecycle_1.0.3 [23] pkgconfig_2.0.3 R6_2.5.1 [25] fastmap_1.1.1 GenomeInfoDbData_1.2.10 [27] shiny_1.7.4.1 digest_0.6.33 [29] aplot_0.2.2 colorspace_2.1-0 [31] patchwork_1.1.2 AnnotationDbi_1.62.2 [33] rprojroot_2.0.3 grr_0.9.5 [35] RSQLite_2.3.1 ggpubr_0.6.0 [37] beachmat_2.16.0 filelock_1.0.2 [39] timechange_0.2.0 fansi_1.0.4 [41] httr_1.4.6 abind_1.4-5 [43] compiler_4.3.0 here_1.0.1 [45] withr_2.5.0 bit64_4.0.5 [47] backports_1.4.1 BiocParallel_1.34.2 [49] orthogene_1.6.1 carData_3.0-5 [51] DBI_1.1.3 homologene_1.4.68.19.3.27 [53] ggsignif_0.6.4 rappdirs_0.3.3 [55] tools_4.3.0 ape_5.7-1 [57] interactiveDisplayBase_1.38.0 httpuv_1.6.11 [59] glue_1.6.2 rhdf5filters_1.12.1 [61] nlme_3.1-163 promises_1.2.0.1 [63] grid_4.3.0 anndata_0.7.5.6 [65] reshape2_1.4.4 generics_0.1.3 [67] gtable_0.3.4 tzdb_0.4.0 [69] hms_1.1.3 data.table_1.14.8 [71] car_3.1-2 utf8_1.2.3 [73] XVector_0.40.0 BiocVersion_3.17.1 [75] pillar_1.9.0 yulab.utils_0.1.0 [77] babelgene_22.9 limma_3.56.2 [79] later_1.3.1 treeio_1.24.3 [81] lattice_0.21-8 bit_4.0.5 [83] tidyselect_1.2.0 locfit_1.5-9.8 [85] Biostrings_2.68.1 stringi_1.7.12 [87] lazyeval_0.2.2 ggfun_0.1.3 [89] yaml_2.3.7 codetools_0.2-19 [91] BiocManager_1.30.21.1 ggplotify_0.1.2 [93] cli_3.6.1 xtable_1.8-4 [95] reticulate_1.30 munsell_0.5.0 [97] zellkonverter_1.10.1 Rcpp_1.0.11 [99] dir.expiry_1.8.0 gprofiler2_0.2.2 [101] png_0.1-8 ellipsis_0.3.2 [103] assertthat_0.2.1 blob_1.2.4 [105] basilisk.utils_1.12.1 sparseMatrixStats_1.12.2 [107] bitops_1.0-7 viridisLite_0.4.2 [109] tidytree_0.4.5 scales_1.2.1 [111] crayon_1.5.2 rlang_1.1.1 [113] KEGGREST_1.40.0 ```
Al-Murphy commented 7 months ago

Hey! Yep definitely a size issue, are you running the function with the as_DelayedArray = TRUE? By default it is FALSE, this might help. It's hard to debug exactly where you are running into the size issue without having access to the data.

Alan.

bschilder commented 7 months ago

@ainefairbrother it would also be very helpful to know what kind of machine you're running this on (storage, memory, is it an computing cluster? etc.). Could you give us more details on this?

Something that can help is to reduce the number of cores being used. Due to how R often deals with parallelisation (copying the entire environment) more cores isn't always better as it can lead to an explosion of memory usage. This will take some troubleshooting on your end to figure out what your machine can handle.

ainefairbrother commented 7 months ago

@ainefairbrother it would also be very helpful to know what kind of machine you're running this on (storage, memory, is it an computing cluster? etc.). Could you give us more details on this?

Something that can help is to reduce the number of cores being used. Due to how R often deals with parallelisation (copying the entire environment) more cores isn't always better as it can lead to an explosion of memory usage. This will take some troubleshooting on your end to figure out what your machine can handle.

This is a server with ~133TB storage, 1TB RAM

Architecture and CPU info:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  256
  On-line CPU(s) list:   0-255

Linux kernal version: 5.15.0-73-generic

ainefairbrother commented 7 months ago

@ainefairbrother it would also be very helpful to know what kind of machine you're running this on (storage, memory, is it an computing cluster? etc.). Could you give us more details on this?

Something that can help is to reduce the number of cores being used. Due to how R often deals with parallelisation (copying the entire environment) more cores isn't always better as it can lead to an explosion of memory usage. This will take some troubleshooting on your end to figure out what your machine can handle.

Noted, thanks. I have tried with setting as_DelayedArray to TRUE and reducing the number of cores. I get a different error:

5 core(s) assigned as workers (251 reserved).
Error in .from_Array_to_matrix(x, ...) : unused argument ("matrix")
Al-Murphy commented 7 months ago

You will need to create and share a subset of your data which replicates the issue if you want us to debug this, the function works with the as_DelayedArray = TRUE with other datasets:

> # Load the single cell data
> cortex_mrna <- ewceData::cortex_mrna()
see ?ewceData and browseVignettes('ewceData') for documentation
loading from cache
> # Use only a subset to keep the example quick
> expData <- cortex_mrna$exp#[1:100, ]
> l1 <- cortex_mrna$annot$level1class
> l2 <- cortex_mrna$annot$level2class
> annotLevels <- list(l1 = l1, l2 = l2)
> fNames_ALLCELLS <- EWCE::generate_celltype_data(
+   exp = expData,
+   annotLevels = annotLevels,
+   groupName = "allKImouse",
+   as_DelayedArray = TRUE
+ )
1 core(s) assigned as workers (10 reserved).
Converting to sparse matrix.
Converting to DelayedArray.
+ Calculating normalized mean expression.
Converting to sparse matrix.
Converting to sparse matrix.
+ Calculating normalized specificity.
Converting to sparse matrix.
Converting to sparse matrix.
Converting to sparse matrix.
Converting to sparse matrix.
+ Saving results ==>  /tmp/Rtmpc3QW1s/ctd_allKImouse.rda

Subsetting on the number of genes and cells should be relatively quick to do.

bschilder commented 7 months ago

Noted, thanks. I have tried with setting as_DelayedArray to TRUE and reducing the number of cores. I get a different error:

5 core(s) assigned as workers (251 reserved).
Error in .from_Array_to_matrix(x, ...) : unused argument ("matrix")

It looks like this is coming from somewhere within DelayedArray, but can't quite pinpoint where or why yet.

Also, @ainefairbrother , can you confirm that your exp DelayedArray object is backed at time you're running the function? ie saved to disk and only reading in the each chunk into memory as-needed. That will ensure you're fully benefitting from the DelayedArray capabilities.

This can be done using this set of functions: https://rdrr.io/bioc/HDF5Array/man/saveHDF5SummarizedExperiment.html

See here for more info: https://www.bioconductor.org/packages/release/bioc/vignettes/DelayedArray/inst/doc/02-Implementing_a_backend.html

ainefairbrother commented 7 months ago

Noted, thanks. I have tried with setting as_DelayedArray to TRUE and reducing the number of cores. I get a different error:

5 core(s) assigned as workers (251 reserved).
Error in .from_Array_to_matrix(x, ...) : unused argument ("matrix")

It looks like this is coming from somewhere within DelayedArray, but can't quite pinpoint where or why yet.

Also, @ainefairbrother , can you confirm that your exp DelayedArray object is backed at time you're running the function? ie saved to disk and only reading in the each chunk into memory as-needed. That will ensure you're fully benefitting from the DelayedArray capabilities.

This can be done using this set of functions: https://rdrr.io/bioc/HDF5Array/man/saveHDF5SummarizedExperiment.html

See here for more info: https://www.bioconductor.org/packages/release/bioc/vignettes/DelayedArray/inst/doc/02-Implementing_a_backend.html

Hi Brian - yes, can confirm that it's backed at the time of running the function.