Closed hpages closed 1 month ago
I'm not sure this is a problem with the matrix multiplication code. If I had to guess, SnowParam
creates a new child R process with a new temporary directory. The HDF5 realization sink dumps content into this different tempdir()
, but that location is destroyed when the child process finishes its work, leading to an error in the parent process. Seems like the realization sink should record the tempdir
of the parent process so that these temporary files are generated in the right place.
Hmm, that's tricky. My understanding is that with some cluster configurations the nodes don't have access to the tempdir()
of the head. So the broader question is whether there is a reliable place where the workers can write files that are guaranteed to be accessible from the head node. And if there is no such place, then we should no longer implement parallel algorithms that pass stuff back to the main process via ~tempdir()~ files. This would be a game changer.
batchtools handles this by assuming that all workers have access to the working directory, which I think is pretty reasonable. So one approach would be to have the realization sinks dump their files into subdirectory in the wd. In fact, I might even say this is preferable as the files don't get deleted when the session closes, so any HDF5Matrix
es that were saved in the workspace or as separate RDS files remain valid. Users can then decide whether or not to delete it afterwards.
I could certainly add something like
if (getAutoRealizationBackend() == "HDF5Array") {
user_data_dir <- tools::R_user_dir(package="DelayedArray", which="data")
dumpdir <- file.path(user_data_dir, basename(tempdir()))
dir.create(dumpdir, showWarnings=FALSE, recursive=TRUE)
old_dumpdir <- HDF5Array::setHDF5DumpDir(dumpdir)
on.exit(HDF5Array::setHDF5DumpDir(old_dumpdir))
}
before the call to realize()
in DelayedArray:::.super_BLOCK_mult()
. That seems to fix the SnowParam case.
However:
We obviously need a cleaner mechanism. In particular the mechanism should be backend agnostic. My understanding is that by default TileDBArray will also write the array data to a place under tempdir()
e.g. when doing as(m, "TileDBArray")
. So right before an on-disk realization is about to be performed, we need a way to tell to whatever on-disk realization backend is about to be used: "Hey, you must write the data to this persistent location."
I still need to figure out a good mechanism for removing the files created by the workers after they're no longer needed.
I'm hesitant to change the default behavior of setHDF5DumpDir()
, that is, to change the location where things like as(m, "HDF5Array")
or realize(m, BACKEND="HDF5Array")
write the array data by default. Most of the time these files are meant to be temporary. Sure, using a persistent location by default would help support the use case where people serialize HDF5Array objects, but that's no big win because doing so is almost always a bad idea. Note that an exception in the near future will be for objects that get created with something like HDF5Array("HubID:EH1039")
, but these objects will already have their data in a persistent location so will be safe for serialization. OTOH I do worry that having setHDF5DumpDir()
set the "dump dir" to a persistent location by default would pollute the user home with a bunch of files that were meant to be temporary.
This got addressed in last February in HDF5Array 1.31.4 (https://github.com/Bioconductor/HDF5Array/commit/48ba2888112efa57f4adda7b4b849cb4784a1ff1).
@LTLA Do you think you can look into this?
Thanks, H.
sessionInfo():