Closed ycli1995 closed 1 year ago
Hi @ycli1995, interesting find, and definitely not intended behavior. I can look at this some on my end to see if I can reproduce and locate the issue. I've described the strategy I'll use below, since I personally struggled a lot with debugging R/C++ code until I figured out how to get a proper debugging setup.
For general advice on diagnosing + debugging tricky errors like this, I think debuggers are the tool to reach for:
write_matrix_hdf5()
is the function call getting stuck. For this, I usually use the debugonce()
builtin function (some instructions in the Advanced R book). This looks nicest in RStudio, but works fine from the command line.Personally, I've made a setup for convenient C++ debugging from VS Code. I write an R script that triggers the bug, then run it under a custom debugger profile. This allows me to set breakpoints and step line-by-line through C++ code that is called from R, and it's how I work through all the trickiest bugs that come up when writing BPCells C++ code. Unfortunately I've only ever gotten it to work reliably on Linux or in WSL on Windows. Despite many attempts I've never gotten it to work reliably on Macs.
Hi @bnprks, Thanks for the reply. I've tried debugonce(write_matrix_hdf5)
and run writeH5SCME
. It did stuck once I clicked Enter as it came to the internal write_function
:
function (matrix, file, group, buffer_size, chunk_size, allow_overwrite,
row_major, gzip_level)
{
invisible(.Call(`_BPCells_write_packed_matrix_hdf5_uint32_t_cpp`,
matrix, file, group, buffer_size, chunk_size, allow_overwrite,
row_major, gzip_level))
}
I will try your advices about the c++ debuger things, although it may take some time and efforts for me to configure my vscode.
Hi, @bnprks . After trying Rscript gdb
as you recommanded, I probably located the c++ code that stucked. The details what I found has been described below:
When excuting the c++ codes line-by-line with debugger, I found that it stucked at this line:
https://github.com/bnprks/BPCells/blob/63986cec4d362be6aaecef62625b0aa4111f48f6/src/arrayIO/hdf5.cpp#L81
I guessed that the reason may be trying to open a non-existing H5 group since the matrix was meant to be written into a new file. Therefore I added one step to create the target H5 group before calling write_matrix_hdf5
. When I tried Rscript gdb
again, the getGroup
call worked.
However, it still stucked after that. This time it came here: https://github.com/bnprks/BPCells/blob/63986cec4d362be6aaecef62625b0aa4111f48f6/src/matrix_io.cpp#L243-L251
With Rscript gdb
I figured out the stucked code was the run_with_R_interrupt_check
call. So I tried the one-step excution here:
https://github.com/bnprks/BPCells/blob/63986cec4d362be6aaecef62625b0aa4111f48f6/src/R_interrupts.h#L15-L37
When I stepped into run_with_R_interrupt_check
, it can be excuted normally. I can even go into the while
loop. So I further set the breaking point at if (interrupt)
line and tried to see what would happen. As I debugged it again, it just stucked and never reached my breaking point.
So I guess maybe the run_with_R_interrupt_check
is where the issue raises from? Unfortunately I'm not a computer science major, and the async
related programming is quite complex for me. If there is any extra information you need for further debugging, please let me know. Thanks.
Hi @ycli1995, thanks so much for taking a look with the debugger and props for getting all the debugger setup to work!
A bit of background on run_with_R_interrupt_check
: that function is a bit tricky for step-by-step execution. What it does is spawn the real function to be called in a background thread, then it loops the main thread checking if the user has pressed Ctrl-C. So if you step through in gdb you'll get the waiting thread, not the worker thread and it's a bit of a red herring.
From your investigation, it's clear that the problem has to do with HDF5 file access operations. I also loaded up your example and got the freezing behavior as well. One tactic I tried out was running in VS Code, then pausing once it got stuck so I could see what was running. I've copied a bit of the observed stack trace below, where we can see the HDF5 library is getting stuck on "H5TS_mutex_lock", being called from within "BPCells::H5NumReader::load".
I'll look into this a bit more, though it seems like the bug might be hard to figure out if the symptom is found only in HDF5 locking internals. Remaining hypotheses that I have:
hdf5r
operations with BPCells operations. I know hdf5r might keep HDF5 files opened, while BPCells tries to open+close files quickly so they don't stay open. Perhaps hdf5r keeping files open is causing problems for BPCells.I've narrowed down what code makes the difference, though I'm still not sure exactly what's going wrong. It seems that if the function .h5ovewrite_bpce
is called first, then a deadlock will happen. If that code is left out as in your working example then everything is fine.
If you're able to avoid creating empty hdf5 files with hdf5r
and instead letting BPCells initialize them automatically, then that might be a workaround.
I still don't fully understand what's going wrong, so I can't say if it's the fault of BPCells, hdf5r, or the hdf5 library itself.
Thank you so much for all the help! I'll try modifying codes in my package to let BPCells initialize HDF5 files as your suggestion, and see if the original issue can be fixed.
I've narrowed down what code makes the difference, though I'm still not sure exactly what's going wrong. It seems that if the function
.h5ovewrite_bpce
is called first, then a deadlock will happen. If that code is left out as in your working example then everything is fine.
It seems that you are right. When I remove the .h5ovewrite_bpce
in writeH5SCME
, it finally works!
I still don't fully understand what's going wrong, so I can't say if it's the fault of BPCells, hdf5r, or the hdf5 library itself.
The weirdest thing is that .h5ovewrite_bpce
is called in both writeH5BPCE
and writeH5SCME
, but it works normally for writeH5BPCE
. Well, maybe I need to spend some efforts to see if I can get around the .h5ovewrite_bpce
call in writeH5SCME
. Thanks again for digging the codes and those useful suggestions for Rcpp debug!
Good luck, hope you get it working! By the way, BPCells.Experiment seems like a great project idea. If you'd like to chat about anything BPCells-related I'd me more than happy to -- feel free to email me at bparks @ stanford.edu, and definitely keep letting me know of github issues whenever something comes up for you
Hi, @bnprks ,
When I'm trying to wrap
write_matrix_hdf5()
into my package, it sometimes crashes the R session.I design a function
writeH5BPCE()
to write aIterableMatrix
, along with the metadata or something else related to the project, into an HDF5 file. Everything seems fine when I try only single omics.Write the RNA experiment into H5, and it worked. In this scope, the function to write a
10xMatrixH5
into an H5 group mainly just wrapswrite_matrix_hdf5()
with some verbose messages.Similarly, writing the ATAC experiment into H5 also worked.
However, for the multi-omics data it failed.
Basically,
writeH5SCME()
just extracts the RNA and ATAC experiments, and callswriteH5BPCE()
as above in afor
loop. It should have worked. However,writeH5SCME()
stucked while writing the 10xMatrixH5 into a H5 group, according to the progress messages. No output followed anymore, neither error nor warning. The R session just somehow crashes, and ctrl+c cannot interupt the program. In the meantime, if I checkhtop
, the R session was just hanging out there and doing nothing.If I manually extract those two experiments on the console, and call
writeH5BPCE()
in afor
loop, exactly like whatwriteH5SCME()
does, it worked without any interuption!Considering all of above, I suspect there must be something wrong when I call
write_matrix_hdf5()
in the function calling stack. However, I cannot find any clue why the writing didn't work insidewriteH5BPCE()
, but worked outside it. This has been trapping me for a few days. Is there any advice from you? I push a repository in case that you may need to reproduce the results: https://github.com/ycli1995/BPCells.testwrapper