bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
167 stars 17 forks source link

`write_matrix_hdf5` crashes when wrapped in other function #52

Closed ycli1995 closed 1 year ago

ycli1995 commented 1 year ago

Hi, @bnprks ,

When I'm trying to wrap write_matrix_hdf5() into my package, it sometimes crashes the R session.

I design a function writeH5BPCE() to write a IterableMatrix, along with the metadata or something else related to the project, into an HDF5 file. Everything seems fine when I try only single omics.

library(BPCells.Experiment)

## Load the example RNA + ATAC dataset in BPCells.Experiment
scme <- load_example_scme()
scme
Default experiment:  RNA 
reducedDimNames(0):
colPairNames(0):
A SingleCellMultiExperiment object of 2 listed
 experiments with user-defined names and respective classes.
 Containing an ExperimentList class object of length 2:
 [1] RNA: BPCExperiment with 217 rows and 90 columns
 [2] ATAC: ChromBPCExperiment with 1308 rows and 90 columns
Functionality:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample coordination DataFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices
 exportClass() - save data to flat files

Write the RNA experiment into H5, and it worked. In this scope, the function to write a 10xMatrixH5 into an H5 group mainly just wraps write_matrix_hdf5() with some verbose messages.

bpce <- writeH5BPCE(scme[[1]], "bpce.h5", overwrite = TRUE)
Writing assay 'counts'
Writing a IterableMatrix:
 Class: 10xMatrixH5
 File: /develop/ycli1995/BPCells.testwrapper/bpce.h5
 Group: /assays/counts
Writing rowRanges
Writing rowData
Writing colData
Writing metadata
Writing altExps
Writing reducedDims
Writing colPairs
Writing rowPairs
Add H5 attribute indicating S4 class: BPCExperiment
Updating altExps
Updating assays
Reading assay 'counts'

Similarly, writing the ATAC experiment into H5 also worked.

cbpce <- writeH5BPCE(scme[[2]], "cbpce.h5", overwrite = TRUE)
cbpce
Writing assay 'counts'
Writing a IterableMatrix:
 Class: RenameDims
 File: /develop/ycli1995/BPCells.testwrapper/cbpce.h5
 Group: /assays/counts
Writing rowRanges
Writing rowData
Writing colData
Writing metadata
Writing altExps
Writing reducedDims
Writing colPairs
Writing rowPairs
Writing fragments
Writing a IterableFragments:
 Class: CellSelectName
 File: /develop/ycli1995/BPCells.testwrapper/cbpce.h5
 Group: /fragments
Writing geneAnnot
Writing seqinfo
Add H5 attribute indicating S4 class: ChromBPCExperiment
Updating altExps
Updating assays
Reading assay 'counts'
Updating fragments
Reading fragments

However, for the multi-omics data it failed.

scme2 <- writeH5SCME(scme, "scme.h5", overwrite = TRUE)
# Writing experiment: RNA
Writing assay 'counts'
Writing a IterableMatrix:
 Class: 10xMatrixH5
 File: /develop/ycli1995/BPCells.testwrapper/scme.h5
 Group: /experiments/RNA/assays/counts

Basically, writeH5SCME() just extracts the RNA and ATAC experiments, and calls writeH5BPCE() as above in a for loop. It should have worked. However, writeH5SCME() stucked while writing the 10xMatrixH5 into a H5 group, according to the progress messages. No output followed anymore, neither error nor warning. The R session just somehow crashes, and ctrl+c cannot interupt the program. In the meantime, if I check htop, the R session was just hanging out there and doing nothing. 屏幕截图 2023-10-26 200547

If I manually extract those two experiments on the console, and call writeH5BPCE() in a for loop, exactly like what writeH5SCME() does, it worked without any interuption!

file <- "scme2.h5"
name <- "/"
gzip_level <- 0L
verbose <- TRUE
object <- h5Prep(scme)
exp.names <- names(x = object$experiments)
for (i in exp.names) {
  if (verbose) {
    message("# Writing experiment: ", i)
  }
  writeH5BPCE(
    object = object$experiments[[i]],
    file = file,
    name = file.path(name, "experiments", i),
    gzip_level = gzip_level,
    verbose = verbose
  )
  BPCells.Experiment:::.set_h5attr_bpce(
    object = exp,
    file = file,
    name = file.path(name, "experiments", i),
    verbose = verbose
  )
  if (verbose) {
    message("Done\n")
  }
}
# Writing experiment: RNA
Writing assay 'counts'
Writing a IterableMatrix:
 Class: 10xMatrixH5
 File: /develop/ycli1995/BPCells.testwrapper/scme2.h5
 Group: /experiments/RNA/assays/counts
Writing rowRanges
Writing rowData
Writing colData
Writing metadata
Writing altExps
Writing reducedDims
Writing colPairs
Writing rowPairs
Add H5 attribute indicating S4 class: BPCExperiment
Updating altExps
Updating assays
Reading assay 'counts'
Done

# Writing experiment: ATAC
Writing assay 'counts'
Writing a IterableMatrix:
 Class: RenameDims
 File: /develop/ycli1995/BPCells.testwrapper/scme2.h5
 Group: /experiments/ATAC/assays/counts
Writing rowRanges
Writing rowData
Writing colData
Writing metadata
Writing altExps
Writing reducedDims
Writing colPairs
Writing rowPairs
Writing fragments
Writing a IterableFragments:
 Class: CellSelectName
 File: /develop/ycli1995/BPCells.testwrapper/scme2.h5
 Group: /experiments/ATAC/fragments
Writing geneAnnot
Writing seqinfo
Add H5 attribute indicating S4 class: ChromBPCExperiment
Updating altExps
Updating assays
Reading assay 'counts'
Updating fragments
Reading fragments
Done

Considering all of above, I suspect there must be something wrong when I call write_matrix_hdf5() in the function calling stack. However, I cannot find any clue why the writing didn't work inside writeH5BPCE(), but worked outside it. This has been trapping me for a few days. Is there any advice from you? I push a repository in case that you may need to reproduce the results: https://github.com/ycli1995/BPCells.testwrapper

bnprks commented 1 year ago

Hi @ycli1995, interesting find, and definitely not intended behavior. I can look at this some on my end to see if I can reproduce and locate the issue. I've described the strategy I'll use below, since I personally struggled a lot with debugging R/C++ code until I figured out how to get a proper debugging setup.

For general advice on diagnosing + debugging tricky errors like this, I think debuggers are the tool to reach for:

Personally, I've made a setup for convenient C++ debugging from VS Code. I write an R script that triggers the bug, then run it under a custom debugger profile. This allows me to set breakpoints and step line-by-line through C++ code that is called from R, and it's how I work through all the trickiest bugs that come up when writing BPCells C++ code. Unfortunately I've only ever gotten it to work reliably on Linux or in WSL on Windows. Despite many attempts I've never gotten it to work reliably on Macs.

Instructions for my VS Code C++ debugging setup **Note:** you'll need to edit the path to `Rscript` in the configuration below to match your system's installation path ### Basic workflow 0. Configure debugging - Make sure C/C++ VS Code extension is installed - Add following config to launch.json ``` { // Use IntelliSense to learn about possible attributes. // Hover to view descriptions of existing attributes. // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387 "version": "0.2.0", "configurations": [ { "name": "Rscript gdb", "type": "cppdbg", "request": "launch", "program": "/share/software/user/open/R/4.0.2/bin/Rscript", "args": [ "--vanilla", "${file}" ], "cwd": "${workspaceFolder}", } ] } ``` 1. Open VSCode remote connection to the target machine 2. Open R script that has the test code I want to run - Tip: Use `devtools::with_debug(devtools::load_all())` In order to get the script to compile the current code in debug mode for easier debugging experience 3. Run debugger with Rscript gdb!! Notes: - I got an illegal instruction exception once, I think due to compilation being out-of-sync
ycli1995 commented 1 year ago

Hi @bnprks, Thanks for the reply. I've tried debugonce(write_matrix_hdf5) and run writeH5SCME. It did stuck once I clicked Enter as it came to the internal write_function:

function (matrix, file, group, buffer_size, chunk_size, allow_overwrite, 
  row_major, gzip_level) 
{
  invisible(.Call(`_BPCells_write_packed_matrix_hdf5_uint32_t_cpp`, 
    matrix, file, group, buffer_size, chunk_size, allow_overwrite, 
    row_major, gzip_level))
}

I will try your advices about the c++ debuger things, although it may take some time and efforts for me to configure my vscode.

ycli1995 commented 1 year ago

Hi, @bnprks . After trying Rscript gdb as you recommanded, I probably located the c++ code that stucked. The details what I found has been described below:

When I stepped into run_with_R_interrupt_check, it can be excuted normally. I can even go into the while loop. So I further set the breaking point at if (interrupt) line and tried to see what would happen. As I debugged it again, it just stucked and never reached my breaking point.

So I guess maybe the run_with_R_interrupt_check is where the issue raises from? Unfortunately I'm not a computer science major, and the async related programming is quite complex for me. If there is any extra information you need for further debugging, please let me know. Thanks.

bnprks commented 1 year ago

Hi @ycli1995, thanks so much for taking a look with the debugger and props for getting all the debugger setup to work!

A bit of background on run_with_R_interrupt_check: that function is a bit tricky for step-by-step execution. What it does is spawn the real function to be called in a background thread, then it loops the main thread checking if the user has pressed Ctrl-C. So if you step through in gdb you'll get the waiting thread, not the worker thread and it's a bit of a red herring.

From your investigation, it's clear that the problem has to do with HDF5 file access operations. I also loaded up your example and got the freezing behavior as well. One tactic I tried out was running in VS Code, then pausing once it got stuck so I could see what was running. I've copied a bit of the observed stack trace below, where we can see the HDF5 library is getting stuck on "H5TS_mutex_lock", being called from within "BPCells::H5NumReader::load".

Stack trace at pause point ``` libhdf5_serial.so.103!H5TS_mutex_lock (Unknown Source:0) libhdf5_serial.so.103!H5Screate_simple (Unknown Source:0) BPCells.so!HighFive::DataSpace::DataSpace<__gnu_cxx::__normal_iterator > >, __gnu_cxx::__normal_iterator > > >(HighFive::DataSpace * const this) (\usr\include\c++\11\bits\stl_vector.h:918) BPCells.so!HighFive::DataSpace::DataSpace(const std::vector > & dims, HighFive::DataSpace * const this) (\tmp\RtmpQmYYnU\R.INSTALL710355f80b9b\bnprks-BPCells-63986ce\src\lib\highfive\bits\H5Dataspace_misc.hpp:25) BPCells.so!HighFive::SliceTraits::select(const HighFive::SliceTraits * const this, const std::vector > & offset, const std::vector > & count, const std::vector > & stride, const std::vector > & block) (\tmp\RtmpQmYYnU\R.INSTALL710355f80b9b\bnprks-BPCells-63986ce\src\lib\highfive\bits\H5Slice_traits_misc.hpp:101) BPCells.so!BPCells::H5NumReader::load(BPCells::H5NumReader * const this, unsigned int * out, uint64_t count) (\tmp\RtmpQmYYnU\R.INSTALL710355f80b9b\bnprks-BPCells-63986ce\src\arrayIO\hdf5.h:83) ```

I'll look into this a bit more, though it seems like the bug might be hard to figure out if the symptom is found only in HDF5 locking internals. Remaining hypotheses that I have:

bnprks commented 1 year ago

I've narrowed down what code makes the difference, though I'm still not sure exactly what's going wrong. It seems that if the function .h5ovewrite_bpce is called first, then a deadlock will happen. If that code is left out as in your working example then everything is fine.

Example code ```R library(BPCells.Experiment) ## Load the example RNA + ATAC dataset in BPCells.Experiment scme <- load_example_scme() file <- "scme2.h5" unlink(file) name <- "/" gzip_level <- 0L verbose <- TRUE # This code causes the problem, whether we call .h5overwrite_bpce or the underlying hdf5r functions, # even if we try very hard to make sure all the hdf5r stuff is cleaned up before getting to BPCells-related code #BPCells.Experiment:::.h5overwrite_bpce(file = file, name = name, overwrite = FALSE) file_handle <- hdf5r::H5File$new(file, "w") file_handle$close_all() rm(file_handle) gc() object <- h5Prep(scme) exp.names <- names(x = object$experiments) for (i in exp.names) { if (verbose) { message("# Writing experiment: ", i) } writeH5BPCE( object = object$experiments[[i]], file = file, name = file.path(name, "experiments", i), gzip_level = gzip_level, verbose = verbose ) BPCells.Experiment:::.set_h5attr_bpce( object = exp, file = file, name = file.path(name, "experiments", i), verbose = verbose ) if (verbose) { message("Done\n") } } ```

If you're able to avoid creating empty hdf5 files with hdf5r and instead letting BPCells initialize them automatically, then that might be a workaround.

I still don't fully understand what's going wrong, so I can't say if it's the fault of BPCells, hdf5r, or the hdf5 library itself.

ycli1995 commented 1 year ago

Thank you so much for all the help! I'll try modifying codes in my package to let BPCells initialize HDF5 files as your suggestion, and see if the original issue can be fixed.

ycli1995 commented 1 year ago

I've narrowed down what code makes the difference, though I'm still not sure exactly what's going wrong. It seems that if the function .h5ovewrite_bpce is called first, then a deadlock will happen. If that code is left out as in your working example then everything is fine.

It seems that you are right. When I remove the .h5ovewrite_bpce in writeH5SCME, it finally works!

I still don't fully understand what's going wrong, so I can't say if it's the fault of BPCells, hdf5r, or the hdf5 library itself.

The weirdest thing is that .h5ovewrite_bpce is called in both writeH5BPCE and writeH5SCME, but it works normally for writeH5BPCE. Well, maybe I need to spend some efforts to see if I can get around the .h5ovewrite_bpce call in writeH5SCME. Thanks again for digging the codes and those useful suggestions for Rcpp debug!

bnprks commented 1 year ago

Good luck, hope you get it working! By the way, BPCells.Experiment seems like a great project idea. If you'd like to chat about anything BPCells-related I'd me more than happy to -- feel free to email me at bparks @ stanford.edu, and definitely keep letting me know of github issues whenever something comes up for you