bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
149 stars 16 forks source link

as.matrix() error when transfering big matrix #129

Open littlemingone opened 1 day ago

littlemingone commented 1 day ago

I tried to use as.matrix() to change a IterableMatrix into a normal matrix or dgCMatrix, the matrix is huge but smaller than 2^31 so it shouldn't exceed the dgCMatrix limit, but I still got an error. The matrix is a Seurat scale data, so there are no zeros in it.

>seurat_obj[['RNA']]$scale.data[1:32768,1:65536] |> as.matrix() # this is a 2^31 matrix test [1:2^15,1:2^16]
Error in (function (cond)  :
  error in evaluating the argument 'x' in selecting a method for function 'as.matrix': Error converting IterableMatrix to dgCMatrix  
* dgCMatrix objects cannot hold more than 2^31 non-zero entries
* Input matrix has 2147483648 entries

>seurat_obj[['RNA']]$scale.data[1:32768,1:65535] |> as.matrix() # 2^16-1=65535
Error in (function (cond)  :
  error in evaluating the argument 'x' in selecting a method for function 'as.matrix': std::bad_alloc

>seurat_obj[['RNA']]$scale.data[1:32768,1:65535] |> as('dgCMatrix')
Error: std::bad_alloc

> traceback()
5: stop(structure(list(message = "std::bad_alloc", call = NULL, 
       cppstack = NULL), class = c("std::bad_alloc", "C++Error",
   "error", "condition")))
4: write_function(it, mat@transpose)
3: write_matrix_memory(convert_matrix_type(from, "double"), compress = FALSE)
2: asMethod(object)
1: as(seurat_obj[["RNA"]]$scale.data[1:32768, 1:65535], "dgCMatrix")
matrix info
 
> sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936    LC_MONETARY=Chinese (Simplified)_China.936 
[4] LC_NUMERIC=C                               LC_TIME=Chinese (Simplified)_China.936

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1      purrr_1.0.2        readr_2.1.5        tidyr_1.3.1
 [7] tibble_3.2.1       tidyverse_2.0.0    patchwork_1.2.0    data.table_1.15.4  BPCells_0.2.0      Seurat_5.1.0
[13] SeuratObject_5.0.2 sp_2.1-4           dplyr_1.1.4        glue_1.7.0         ggplot2_3.5.1

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3      jsonlite_1.8.8          magrittr_2.0.3          ggbeeswarm_0.7.2        spatstat.utils_3.1-0
  [6] farver_2.1.2            zlibbioc_1.50.0         vctrs_0.6.5             ROCR_1.0-11             Cairo_1.6-2
 [11] spatstat.explore_3.2-6  htmltools_0.5.8.1       sctransform_0.4.1       parallelly_1.38.0       KernSmooth_2.23-24
 [16] htmlwidgets_1.6.4       ica_1.0-3               plyr_1.8.9              plotly_4.10.4           zoo_1.8-12
 [21] igraph_2.0.3            mime_0.12               lifecycle_1.0.4         pkgconfig_2.0.3         Matrix_1.6-5
 [26] R6_2.5.1                fastmap_1.2.0           GenomeInfoDbData_1.2.12 MatrixGenerics_1.16.0   fitdistrplus_1.2-1
 [31] future_1.34.0           shiny_1.9.1             digest_0.6.37           colorspace_2.1-1        S4Vectors_0.42.1
 [36] tensor_1.5              RSpectra_0.16-2         irlba_2.3.5.1           GenomicRanges_1.56.1    labeling_0.4.3
 [41] progressr_0.14.0        timechange_0.3.0        fansi_1.0.6             spatstat.sparse_3.1-0   httr_1.4.7
 [46] polyclip_1.10-7         abind_1.4-5             compiler_4.4.1          withr_3.0.1             fastDummies_1.7.4
 [51] MASS_7.3-60.0.1         tools_4.4.1             vipor_0.4.7             lmtest_0.9-40           beeswarm_0.4.0
 [56] httpuv_1.6.15           future.apply_1.11.2     goftest_1.2-3           nlme_3.1-165            promises_1.3.0
 [61] grid_4.4.1              Rtsne_0.17              cluster_2.1.6           reshape2_1.4.4          generics_0.1.3
 [66] gtable_0.3.5            spatstat.data_3.1-2     tzdb_0.4.0              hms_1.1.3               utf8_1.2.4
 [71] XVector_0.44.0          BiocGenerics_0.50.0     spatstat.geom_3.2-9     RcppAnnoy_0.0.22        ggrepel_0.9.5
 [76] RANN_2.6.1              pillar_1.9.0            spam_2.10-0             RcppHNSW_0.6.0          later_1.3.2
 [81] splines_4.4.1           lattice_0.22-6          survival_3.7-0          deldir_2.0-4            httpgd_2.0.2
 [86] tidyselect_1.2.1        miniUI_0.1.1.1          pbapply_1.7-2           gridExtra_2.3           IRanges_2.38.1
 [91] scattermore_1.2         stats4_4.4.1            matrixStats_1.3.0       stringi_1.8.4           UCSC.utils_1.0.0
 [96] unigd_0.1.2             lazyeval_0.2.2          codetools_0.2-20        cli_3.6.3               uwot_0.1.16
[101] systemfonts_1.1.0       xtable_1.8-4            reticulate_1.38.0       munsell_0.5.1           Rcpp_1.0.13
[106] GenomeInfoDb_1.40.1     globals_0.16.3          spatstat.random_3.2-3   png_0.1-8               ggrastr_1.0.2
[111] parallel_4.4.1          dotCall64_1.1-1         listenv_0.9.1           viridisLite_0.4.2       scales_1.3.0
[116] ggridges_0.5.6          leiden_0.4.3.1          rlang_1.1.4             cowplot_1.1.3
  
session info
 
>seurat_obj[['RNA']]$scale.data
36601 x 122043 IterableMatrix object with class RenameDims

Row names: MIR1302-2HG, FAM138A ... AC007325.2
Col names: AAACCTGAGACTGGGT-1_1, AAACCTGAGTATCTCG-1_1 ... TTTGTCATCTTGCATT-1_14

Data type: double
Storage order: column major

Queued Operations:
1. Load compressed matrix from directory E:\test_dir\on_disk_data\count
2. Select rows: 1, 2 ... 36601 and cols: 1, 2 ... 124901
3. Reset dimnames
4. Convert type from uint32_t to double
5. Scale by 1e+04
6. Scale columns by 9.55e-05, 0.000114 ... 0.000233
7. Transform log1p
8. Reset dimnames
9. Transform min by row: 0.015, 0.1 ... 0.124
10. Scale rows by 668, 100 ... 80.7
11. Shift rows by -0.00286, 0 ... -0.016
12. Reset dimnames
  

But I asked someone else tried the same thing, it worked fine. And I tried in a new installed R-4.4.1 (cran R installer) withinstall.packages("BPCells", repos = c("https://bnprks.r-universe.dev", "https://cloud.r-project.org")) , readed the on-disk-data created before, got the same error.

And, by the way, I think we might need a function that can read csv or tsv files as dgCMatrix or some other low memory cost object like the BPCells on disk data. So we can read a huge csv without reading it as dataframe first and changing to matrix and dgCMatrix. Some data were stored as csv files. For example this, containing more than 120K cells in a csv file. To analyze this data, I need to read the whole file as data.frame and sequently changed its class by as.matrix() and as('dgCMatrix'). It was slow and torment.

bnprks commented 17 hours ago

Hi @littlemingone, do you think it's possible your computer was running out of memory for the tests that resulted in std::bad_alloc errors? Each matrix you're requesting would take around 16GB of RAM for the final object, but BPCells will use several times that amount in intermediate steps -- on my own laptop I observed ~50GB of RAM usage before I had to force-quit. If it worked on someone else's computer, I wonder if it's just a matter of them having a lot more RAM available.

Do you have a reason you need to make such large dense matrices from BPCells objects? This isn't really a use-case I've optimized for.

As for functions to read large csv, that's not currently on the roadmap for BPCells (though we're always open to look at pull requests if others have new features they have written and want to contribute). I might suggest trying the data.frame or vroom libraries for fast reading of CSVs. Those libraries often also have an option to read a subset of the rows of a csv (e.g. via skip and n_max in vroom), which you might be able to use to read the file in several chunks to reduce your memory usage. It looks like the dataset you linked would take 38GB of RAM as a dense matrix, so you'd probably benefit from splitting it up into 5-10 chunks.

I'd also double-check your free RAM with e.g. task manager on windows, since spilling out of RAM into swap space is also a potential cause of extreme slowness.

-Ben

littlemingone commented 7 hours ago

In fact, I noticed this problem from a bug of Seurat::RunPCA(). It should be a bug I guess. As I said, I tried to handle a huge count matrix with more than 120K cells, and of course I used BPCells to accrelate the analyse. But at the RunPCA() step, I got the error about dgCMatrix exceeding 2^31 values.

* dgCMatrix objects cannot hold more than 2^31 non-zero entries
* Input matrix has 3365020071 entries 

RunPCA() step use scale.data layer so most of the value is not zero. I use all gene to generate the scale.data in case I need some gene to draw a Heatmap. I ran the RunPCA() with no features parameter, so it should use only 2000 hvgs and 2000*120000 shouldn't exceed the 2^31 limits. But error info said the matrix have 3365020071 entries, which is precisely the size of the whole scale.data matrix. So the Seurat::RunPCA() will change the whole scale.data matrix instead only the part it need. And with a data at 100K cell level, the complete scale.data will 100% exceed the dgCMatrix limit.

After the 2^31 error, I tried to change the scale.data into a dense matrix to fix the problem, and I got the std::bad_alloc error. At last , I fixed it by running the ScaleData with only 2000 hvgs first, then RunPCA().After generating the pca data, I ran the ScaleData again with all gene.

Maybe handling a huge dense matrix is not the use-case you expect for, but with the current version Seurat, BPCells will meet this situation.

And about the RAM cost, the (4.4.1 R doc)[https://search.r-project.org/R/refmans/base/html/Memory-limits.html] says all objects will be hold in virtual memory. I have raised my virtual memory to 150G before the test, and my physical is 64GB, so I think it should be OK. But there were some other tasks running on my computer at that time, so I am not sure. But when I called a small subset of the data, I didn't got the std::bad_alloc error. I will try later.