bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
167 stars 17 forks source link

class S4 is not subsettablle #78

Open Flu09 opened 8 months ago

Flu09 commented 8 months ago

I submitted the same question at https://github.com/satijalab/seurat/issues/8604

I just do not know the source of the issue if its the BPCells library or Seurat.

but this issue seems to happen only when I use conda.

h5ad file wget -c https://datasets.cellxgene.cziscience.com/a5463d8f-07df-4870-8cae-bc504de762c8.h5ad

that is how i installed the two packages:

conda create -n Rtools r-base=4.3.2 conda activate Rtools
conda install conda-forge::r-seurat conda install conda-forge::r-matrix=1.6.3 conda install rschauner::r-bpcells the I updated seurat from r using install.packages("Seurat")

bnprks commented 8 months ago

Hi @Flu09, thanks for your question. Would you be able to provide an end-to-end example that shows code to reproduce the bug you're getting? (You can skip the installation steps, just starting from the code to create a Seurat project from the h5ad link you list)

It would be pretty quick for me to find where the issue is happening using R's debugging tools, but only once I'm able to run and reproduce the issue on my end.

Two related tips in case they're useful to you:

  1. If you can use backticks to get block code formatting nicely in markdown as follows: (putting the "r" at the end of the first line makes it use R syntax highlighting)
    ```r
    # R code
    x <- 1
  2. If you put the code that causes a crash into a function, then put a call to the built-in browser() function before the line that causes the crash, it's possible to use R's builtin debugger to find where and how crashes are happening. (More info here).
Flu09 commented 8 months ago

Hello,

I run a new test on the file. This time I did not add the metadata nor imported the dim reduction. and I got this time a 'memory not mapped' when running RunPCA(). I run the scripts using slurm, so probably browser() would not work for me.

caught segfault address 0x149ccf3533f0, cause 'memory not mapped'

Traceback: 1: build_csparse_matrix_double_cpp(iter) 2: asMethod(object) 3: as(from, "dgCMatrix") 4: as.matrix(.) 5: as(from, "dgCMatrix") %>% as.matrix() 6: asMethod(object) 7: as(x, "matrix") 8: as.matrix.IterableMatrix(X) 9: as.matrix(X) 10: apply(X = data.use, MARGIN = 1L, FUN = var) 11: PrepDR5(object = object, features = features, layer = layer, verbose = verbose) 12: RunPCA.StdAssay(object = object[[assay]], assay = assay, features = features, npcs = npcs, rev.pca = rev.pca, weight.by.var = weight.by.var, verbose = verbose, ndims.print = ndims.print, nfeatures.print = nfeatures.print, reduction.key = reduction.key, seed.use = seed.use, ...) 13: RunPCA(object = object[[assay]], assay = assay, features = features, npcs = npcs, rev.pca = rev.pca, weight.by.var = weight.by.var, verbose = verbose, ndims.print = ndims.print, nfeatures.print = nfeatures.print, reduction.key = reduction.key, seed.use = seed.use, ...) 14: RunPCA.Seurat(seurat_object) 15: RunPCA(seurat_object) 16: eval(expr, envir, enclos) 17: eval(expr, envir, enclos) 18: eval_with_user_handlers(expr, envir, enclos, user_handlers) 19: withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers)) 20: withCallingHandlers(withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers)), warning = wHandler, error = eHandler, message = mHandler) 21: handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers)), warning = wHandler, error = eHandler, message = mHandler)) 22: timing_fn(handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers)), warning = wHandler, error = eHandler, message = mHandler))) 23: evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos, debug = debug, last = i == length(out), use_try = stop_on_error != 2L, keep_warning = keep_warning, keep_message = keep_message, log_echo = log_echo, log_warning = log_warning, output_handler = output_handler, include_timing = include_timing) 24: evaluate::evaluate(...) 25: evaluate(code, envir = env, new_device = FALSE, keep_warning = if (is.numeric(options$warning)) TRUE else options$warning, keep_message = if (is.numeric(options$message)) TRUE else options$message, stop_on_error = if (is.numeric(options$error)) options$error else { if (options$error && options$include) 0L else 2L }, output_handler = knit_handlers(options$render, options)) 26: in_dir(input_dir(), expr) 27: in_input_dir(evaluate(code, envir = env, new_device = FALSE, keep_warning = if (is.numeric(options$warning)) TRUE else options$warning, keep_message = if (is.numeric(options$message)) TRUE else options$message, stop_on_error = if (is.numeric(options$error)) options$error else { if (options$error && options$include) 0L else 2L }, output_handler = knit_handlers(options$render, options))) 28: eng_r(options) 29: block_exec(params) 30: call_block(x) 31: process_group.block(group) 32: process_group(group) 33: withCallingHandlers(if (tangle) process_tangle(group) else process_group(group), error = function(e) if (xfun::pkg_available("rlang", "1.0.0")) rlang::entrace(e)) 34: withCallingHandlers(expr, error = function(e) { loc = paste0(current_lines(), label, sprintf(" (%s)", knit_concord$get("infile"))) message(one_string(handler(e, loc)))}) 35: handle_error(withCallingHandlers(if (tangle) process_tangle(group) else process_group(group), error = function(e) if (xfun::pkg_available("rlang", "1.0.0")) rlang::entrace(e)), function(e, loc) { setwd(wd) write_utf8(res, output %n% stdout()) paste0("\nQuitting from lines ", loc) }, if (labels[i] != "") sprintf(" [%s]", labels[i])) 36: process_file(text, output) 37: knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet) 38: rmarkdown::render("/user/brainref/hope.Rmd") An irrecoverable exception occurred. R is aborting now ... /var/spool/slurm/job32765541/slurm_script: line 25: 3267803 Segmentation fault Rscript -e "rmarkdown::render('/user/brainref/test.Rmd')"

code:

library(Seurat)
library(BPCells)
file_path <- "/user/brainref/a5463d8f-07df-4870-8cae-bc504de762c8.h5ad"
data <- open_matrix_anndata_hdf5(file_path)
write_matrix_dir(mat = data, dir = gsub(".h5ad", "_BP", file_path))
mat <- open_matrix_dir(dir = gsub(".h5ad", "_BP", file_path))
seurat_object  <- CreateSeuratObject(counts = mat)
seurat_object <- NormalizeData(seurat_object)
seurat_object  <- FindVariableFeatures(seurat_object)
seurat_object  <- ScaleData(seurat_object)
seurat_object
#saveRDS(seurat_object,  "/user/brainref/new_ref.rds")
options(error = function() {
  dump.frames(to.file = TRUE, to.file.raise = FALSE)
})
seurat_object <- RunPCA(seurat_object)
bnprks commented 5 months ago

Hi @Flu09, sorry for the long delay on this. I think I have isolated this particular error to a problem in Seurat which I have just opened a pull request for.

In short, the PrepDR5 function from Seurat ends up converting the whole scale.data layer to an in-memory object accidentally, and a relatively simple change can help avoid that.

I'm pretty sure that the crash you experienced was related to memory issues, either directly running out of memory or problems with R dgCMatrix objects not supporting more than 2.1 billion entries without crashing.

I'm trying to do a run on your example dataset now to confirm that my fix will work for you -- I'll update here once I've got that run to completion with and without my fix for Seurat.

Flu09 commented 3 months ago

Hello, how are you?

Thank you. I have another question which might be related. I have compiled some samples into a BPCells object its about 1.5 million cells and i faced this issue on the RunPCA step. I am not sure to open a new thread or it might be related so I just posted it here.

mergedf[["RNA"]] <- split(mergedf[["RNA"]], f = mergedf$dataset)
mergedf <- NormalizeData(mergedf)
mergedf <- FindVariableFeatures(mergedf)
mergedf <- ScaleData(mergedf, features = rownames(mergedf))
mergedf <- RunPCA(mergedf)

mergedf <- RunPCA(mergedf) Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': invalid format '%d'; use format %f, %e, %g or %a for numeric objects

mergedf An object of class Seurat 40090 features across 1530043 samples within 1 assay Active assay: RNA (40090 features, 2000 variable features) 29 layers present: counts.dataset1, counts.dataset2, counts.dataset3, counts.dataset4, counts.dataset5, counts.dataset6, counts.dataset7, counts.dataset8, counts.dataset9, counts.dataset10, counts.dataset11, counts.dataset12, counts.dataset13, counts.dataset14, data.dataset1, data.dataset2, data.dataset3, data.dataset4, data.dataset5, data.dataset6, data.dataset7, data.dataset8, data.dataset9, data.dataset10, data.dataset11, data.dataset12, data.dataset13, datadataset14, scale.data

> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Rocky Linux 9.3 (Blue Onyx)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.21.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] batchelor_1.20.0            SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0 Biobase_2.64.0             
 [5] GenomicRanges_1.56.0        GenomeInfoDb_1.40.1         IRanges_2.38.0              S4Vectors_0.42.0           
 [9] BiocGenerics_0.50.0         MatrixGenerics_1.16.0       matrixStats_1.3.0           SeuratWrappers_0.3.2       
[13] patchwork_1.2.0             ggrepel_0.9.5               ggplot2_3.5.1               dplyr_1.1.4                
[17] SeuratDisk_0.0.0.9021       Seurat_5.0.0                BPCells_0.2.0               SeuratObject_5.0.2         
[21] sp_2.1-4                   

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22          splines_4.4.0             later_1.3.2               tibble_3.2.1              R.oo_1.26.0              
  [6] polyclip_1.10-7           fastDummies_1.7.3         lifecycle_1.0.4           globals_0.16.3            lattice_0.22-6           
 [11] hdf5r_1.3.11              MASS_7.3-60.2             magrittr_2.0.3            plotly_4.10.4             rmarkdown_2.27           
 [16] yaml_2.3.10               remotes_2.5.0             httpuv_1.6.15             sctransform_0.4.1         spam_2.10-0              
 [21] sessioninfo_1.2.2         pkgbuild_1.4.4            spatstat.sparse_3.1-0     reticulate_1.37.0         cowplot_1.1.3            
 [26] pbapply_1.7-2             RColorBrewer_1.1-3        ResidualMatrix_1.14.1     abind_1.4-5               pkgload_1.3.4            
 [31] zlibbioc_1.50.0           Rtsne_0.17                purrr_1.0.2               R.utils_2.12.3            GenomeInfoDbData_1.2.12  
 [36] irlba_2.3.5.1             listenv_0.9.1             spatstat.utils_3.0-5      goftest_1.2-3             RSpectra_0.16-2          
 [41] spatstat.random_3.3-1     fitdistrplus_1.2-1        parallelly_1.38.0         DelayedMatrixStats_1.26.0 leiden_0.4.3.1           
 [46] codetools_0.2-20          DelayedArray_0.30.1       scuttle_1.14.0            DT_0.33                   tidyselect_1.2.1         
 [51] UCSC.utils_1.0.0          ScaledMatrix_1.12.0       spatstat.explore_3.3-1    jsonlite_1.8.8            BiocNeighbors_1.22.0     
 [56] ellipsis_0.3.2            progressr_0.14.0          ggridges_0.5.6            survival_3.7-0            tools_4.4.0              
 [61] ica_1.0-3                 Rcpp_1.0.13               glue_1.7.0                gridExtra_2.3             SparseArray_1.4.8        
 [66] xfun_0.46                 usethis_2.2.3             withr_3.0.1               BiocManager_1.30.23       fastmap_1.2.0            
 [71] fansi_1.0.6               digest_0.6.36             rsvd_1.0.5                R6_2.5.1                  mime_0.12                
 [76] colorspace_2.1-1          scattermore_1.2           tensor_1.5                spatstat.data_3.1-2       R.methodsS3_1.8.2        
 [81] utf8_1.2.4                tidyr_1.3.1               generics_0.1.3            data.table_1.15.4         httr_1.4.7               
 [86] htmlwidgets_1.6.4         S4Arrays_1.4.1            uwot_0.2.2                pkgconfig_2.0.3           gtable_0.3.5             
 [91] lmtest_0.9-40             XVector_0.44.0            htmltools_0.5.8.1         profvis_0.3.8             dotCall64_1.1-1          
 [96] scales_1.3.0              png_0.1-8                 spatstat.univar_3.0-0     knitr_1.48                rstudioapi_0.16.0        
[101] reshape2_1.4.4            nlme_3.1-165              cachem_1.1.0              zoo_1.8-12                stringr_1.5.1            
[106] KernSmooth_2.23-24        parallel_4.4.0            miniUI_0.1.1.1            pillar_1.9.0              grid_4.4.0               
[111] vctrs_0.6.5               RANN_2.6.1                urlchecker_1.0.1          promises_1.3.0            BiocSingular_1.20.0      
[116] beachmat_2.20.0           xtable_1.8-4              cluster_2.1.6             evaluate_0.24.0           cli_3.6.3                
[121] compiler_4.4.0            rlang_1.1.4               crayon_1.5.3              future.apply_1.11.2       plyr_1.8.9               
[126] fs_1.6.4                  stringi_1.8.4             BiocParallel_1.38.0       viridisLite_0.4.2         deldir_2.0-4             
[131] munsell_0.5.1             lazyeval_0.2.2            devtools_2.4.5            spatstat.geom_3.3-2       Matrix_1.7-0             
[136] RcppHNSW_0.6.0            sparseMatrixStats_1.16.0  bit64_4.0.5               future_1.34.0             shiny_1.9.1              
[141] ROCR_1.0-11               igraph_2.0.3              memoise_2.0.1             bit_4.0.5  
bnprks commented 3 months ago

Hi @Flu09, I think the error message you're seeing might be the same as in issue #95. That specific problem in creating an error message should be fixed if you re-install the latest BPCells, but unfortunately the underlying issue with RunPCA will probably still be there as it is due to an issue in Seurat's internal PrepDR5 function as mentioned above.

The fix I submitted at the end of May has not made it into the Seurat main branch yet, but it is in the develop branch. If you re-install Seurat from the github develop branch with remotes::install_github("satijalab/seurat", ref="develop") I think it will make your issue go away. You might also consider updating to the latest BPCells with remotes::install_github("bnprks/BPCells/r"), but without changing Seurat that will just result in a nicer error message being printed for this specific issue.

Hope that helps!

-Ben

Flu09 commented 3 months ago

Thank you. I got the develop version then I noticed it gave me alot of warning and errors so I did not give the RunPCA a test. then I tried to install back Seurat and now I get the 2 ^31 error when Joining the layers of objects ( a step i did numerously before without issues) The object together make 1.3 million. I am not sure what went wrong.

bnprks commented 3 months ago

Hmm, I'm not personally very familiar with Seurat workflows that involve splitting/joining layers (usually I just use rbind/cbind to concatenate all my inputs and use a single layer during analysis). So it's hard for me to guess what might be going wrong without being able to test out an example with code + data on my end

Are you still working off of the h5ad file you linked initially in this issue? If you're able to share with me some example code + data that starts from an h5ad file and hits the errors you're talking about here then I can do a better job of tracking down where the crash is happening and what might be possible to fix it.

Flu09 commented 3 months ago

no I am working on a different one. I can prepare some data to send you. it is just this error now keep showing up after the bunch of install uninstall I did yesterday. Now even after sketching the BPCells object I cannot do RunPCA on the sketched assay.

`Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': invalid format '%d'; use format %f, %e, %g or %a for numeric objects

18. h(simpleError(msg, call)) 17. .handleSimpleError(function (cond) .Internal(C_tryCatchHelper(addr, 1L, cond)), "invalid format '%d'; use format %f, %e, %g or %a for numeric objects", base::quote(sprintf("Input matrix has %d entries", length(res@index)))) 16. sprintf("Input matrix has %d entries", length(res@index)) 15. is_formula(message, scoped = TRUE, lhs = FALSE) 14. rlang::abort(c("Error converting IterableMatrix to dgCMatrix", "dgCMatrix objects cannot hold more than 2^31 non-zero entries", sprintf("Input matrix has %d entries", length(res@index)))) 13. asMethod(object) 12. as(from, "dgCMatrix") 11. as.matrix(as(from, "dgCMatrix")) 10. asMethod(object) 9. as(x, "matrix") 8. as.matrix.IterableMatrix(X) 7. as.matrix(X) 6. apply(X = data.use, MARGIN = 1L, FUN = var) 5. PrepDR5(object = object, features = features, layer = layer, verbose = verbose) 4. RunPCA.StdAssay(object = object[[assay]], assay = assay, features = features, npcs = npcs, rev.pca = rev.pca, weight.by.var = weight.by.var, verbose = verbose, ndims.print = ndims.print, nfeatures.print = nfeatures.print, reduction.key = reduction.key, seed.use = seed.use, ...) 3. RunPCA(object = object[[assay]], assay = assay, features = features, npcs = npcs, rev.pca = rev.pca, weight.by.var = weight.by.var, verbose = verbose, ndims.print = ndims.print, nfeatures.print = nfeatures.print, reduction.key = reduction.key, seed.use = seed.use, ...) 2. RunPCA.Seurat(filtered_obj_sketched) 1. RunPCA(filtered_obj_sketched)

`

bnprks commented 3 months ago

Please do let me know once you get an example set up, though please also make sure you have the latest BPCells and Seurat versions installed as I described above.

This particular error message indicates to me that you don't have the most recent version of BPCells installed and your version of Seurat also doesn't match the latest from the develop branch. The key problem here is the Seurat version. I've included some instructions below to check if you have the necessary fixes applied in your current running Seurat version:

How to check for Seurat version Run: `Seurat:::PrepDR5`, which should print out its source code. In the correct version, you should see the lines: ```r if (is(data.use, "IterableMatrix")) { features.var <- BPCells::matrix_stats(matrix=data.use, row_stats="variance")$row_stats["variance",] } else { features.var <- apply(X = data.use, MARGIN = 1L, FUN = var) } features.keep <- features[features.var > 0] ``` If instead you see these versions then you still have the buggy Seurat version ```r features.var <- apply(X = data.use, MARGIN = 1L, FUN = var) features.keep <- features[features.var > 0] ```
How to check for BPCells version Run: ```r library(BPCells) getMethod("coerce", signature(from="IterableMatrix", to="dgCMatrix")) ``` If you see this line then you have a recent-enough version: ```r sprintf("Input matrix has %0.f entries", length(res@index)))) ``` If instead you see this line then your version is too old: ```r sprintf("Input matrix has %d entries", length(res@index)))) ```

Short of installing a full new Seurat version, there's also a quick script that you can run to temporarily work around the relevant Seurat bug:

Script to source/run to fix RunPCA without re-installing Seurat ```r fixed_PrepDR5 <- function(object, features = NULL, layer = 'scale.data', verbose = TRUE) { layer <- layer[1L] olayer <- layer layer <- SeuratObject::Layers(object = object, search = layer) if (is.null(layer)) { abort(paste0("No layer matching pattern '", olayer, "' not found. Please run ScaleData and retry")) } data.use <- SeuratObject::LayerData(object = object, layer = layer) features <- features %||% VariableFeatures(object = object) if (!length(x = features)) { stop("No variable features, run FindVariableFeatures() or provide a vector of features", call. = FALSE) } if (is(data.use, "IterableMatrix")) { features.var <- BPCells::matrix_stats(matrix=data.use, row_stats="variance")$row_stats["variance",] } else { features.var <- apply(X = data.use, MARGIN = 1L, FUN = var) } features.keep <- features[features.var > 0] if (!length(x = features.keep)) { stop("None of the requested features have any variance", call. = FALSE) } else if (length(x = features.keep) < length(x = features)) { exclude <- setdiff(x = features, y = features.keep) if (isTRUE(x = verbose)) { warning( "The following ", length(x = exclude), " features requested have zero variance; running reduction without them: ", paste(exclude, collapse = ', '), call. = FALSE, immediate. = TRUE ) } } features <- features.keep features <- features[!is.na(x = features)] features.use <- features[features %in% rownames(data.use)] if(!isTRUE(all.equal(features, features.use))) { missing_features <- setdiff(features, features.use) if(length(missing_features) > 0) { warning_message <- paste("The following features were not available: ", paste(missing_features, collapse = ", "), ".", sep = "") warning(warning_message, immediate. = TRUE) } } data.use <- data.use[features.use, ] return(data.use) } assignInNamespace('PrepDR5', fixed_PrepDR5, 'Seurat') ```

And in case you're curious to know a bit more of what's going wrong behind the scenes, here's some information:

A few more technical details about this problem The core problem that caused your original bug report back in March is that Seurat's `RunPCA` function calls `PrepDR5`, which as one step tries to filter out all genes that have zero variance (i.e. have the same value measured in every cell). The way it did this required converting a full BPCells matrix into memory, which effectively negated all of the benefits of the BPCells disk-backed approach. I submitted a fix to Seurat that instead used BPCells existing disk-backed functions to do the zero-variance check. As for what specifically goes wrong: BPCells forms dense matrix objects in memory by first converting to a dgCMatrix, then converting that to a base R matrix. Unfortunately, dgCMatrix objects cannot handle more than 2^31 non-zero values and cause the segfault and abrupt crashes you originally experienced. I added in a check in BPCells to prevent forming a dgCMatrix object with more data than it can handle. This means BPCells will now throw an error but it will prevent crashing your whole R session. It is worth noting that Seurat without BPCells suffers the same crashes on datasets over ~1M cells due to the same dgCMatrix problem. With the fix I made for Seurat, RunPCA will never do the erroneous conversion to an in-memory matrix, thus preventing the problem entirely when using BPCells. Unfortunately, Seurat has been a little slow getting the fix into a public CRAN release, so it requires a bit of work to use the fixed version still.