Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

Parallel matrix multiplication with HDF5Array realization and BiocParallel::SnowParam() fails #76

Closed hpages closed 1 month ago

hpages commented 3 years ago

@LTLA Do you think you can look into this?

library(DelayedArray)
m <- matrix(1:12, nrow=3)

setAutoRealizationBackend("HDF5Array")
DelayedArray(m) %*% DelayedArray(t(m))
# <3 x 3> matrix of class HDF5Matrix and type "double":
#      [,1] [,2] [,3]
# [1,]  166  188  210
# [2,]  188  214  240
# [3,]  210  240  270

setAutoBPPARAM(BiocParallel::MulticoreParam(workers=2))
DelayedArray(m) %*% DelayedArray(t(m))
# <3 x 3> matrix of class HDF5Matrix and type "double":
#      [,1] [,2] [,3]
# [1,]  166  188  210
# [2,]  188  214  240
# [3,]  210  240  270

setAutoBPPARAM(BiocParallel::SnowParam(workers=2))
DelayedArray(m) %*% DelayedArray(t(m))
# Error in h(simpleError(msg, call)) : 
#   error in evaluating the argument 'x' in selecting a method for function 'type': failed to open file '/tmp/RtmppdNsf6/HDF5Array_dump/auto00001.h5'

Thanks, H.

sessionInfo():

R version 4.0.2 Patched (2020-06-24 r78746)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.r78746/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r78746/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] HDF5Array_1.18.0     rhdf5_2.34.0         DelayedArray_0.16.0 
[4] IRanges_2.24.0       S4Vectors_0.28.0     MatrixGenerics_1.2.0
[7] matrixStats_0.57.0   BiocGenerics_0.36.0  Matrix_1.2-18       

loaded via a namespace (and not attached):
[1] lattice_0.20-41     snow_0.4-3          rhdf5filters_1.2.0 
[4] grid_4.0.2          Rhdf5lib_1.12.0     BiocParallel_1.24.0
[7] tools_4.0.2         compiler_4.0.2     
LTLA commented 3 years ago

I'm not sure this is a problem with the matrix multiplication code. If I had to guess, SnowParam creates a new child R process with a new temporary directory. The HDF5 realization sink dumps content into this different tempdir(), but that location is destroyed when the child process finishes its work, leading to an error in the parent process. Seems like the realization sink should record the tempdir of the parent process so that these temporary files are generated in the right place.

hpages commented 3 years ago

Hmm, that's tricky. My understanding is that with some cluster configurations the nodes don't have access to the tempdir() of the head. So the broader question is whether there is a reliable place where the workers can write files that are guaranteed to be accessible from the head node. And if there is no such place, then we should no longer implement parallel algorithms that pass stuff back to the main process via ~tempdir()~ files. This would be a game changer.

LTLA commented 3 years ago

batchtools handles this by assuming that all workers have access to the working directory, which I think is pretty reasonable. So one approach would be to have the realization sinks dump their files into subdirectory in the wd. In fact, I might even say this is preferable as the files don't get deleted when the session closes, so any HDF5Matrixes that were saved in the workspace or as separate RDS files remain valid. Users can then decide whether or not to delete it afterwards.

hpages commented 3 years ago

I could certainly add something like

    if (getAutoRealizationBackend() == "HDF5Array") {
        user_data_dir <- tools::R_user_dir(package="DelayedArray", which="data")
        dumpdir <- file.path(user_data_dir, basename(tempdir()))
        dir.create(dumpdir, showWarnings=FALSE, recursive=TRUE)
        old_dumpdir <- HDF5Array::setHDF5DumpDir(dumpdir)
        on.exit(HDF5Array::setHDF5DumpDir(old_dumpdir))
    }

before the call to realize() in DelayedArray:::.super_BLOCK_mult(). That seems to fix the SnowParam case.

However:

  1. We obviously need a cleaner mechanism. In particular the mechanism should be backend agnostic. My understanding is that by default TileDBArray will also write the array data to a place under tempdir() e.g. when doing as(m, "TileDBArray"). So right before an on-disk realization is about to be performed, we need a way to tell to whatever on-disk realization backend is about to be used: "Hey, you must write the data to this persistent location."

  2. I still need to figure out a good mechanism for removing the files created by the workers after they're no longer needed.

  3. I'm hesitant to change the default behavior of setHDF5DumpDir(), that is, to change the location where things like as(m, "HDF5Array") or realize(m, BACKEND="HDF5Array") write the array data by default. Most of the time these files are meant to be temporary. Sure, using a persistent location by default would help support the use case where people serialize HDF5Array objects, but that's no big win because doing so is almost always a bad idea. Note that an exception in the near future will be for objects that get created with something like HDF5Array("HubID:EH1039"), but these objects will already have their data in a persistent location so will be safe for serialization. OTOH I do worry that having setHDF5DumpDir() set the "dump dir" to a persistent location by default would pollute the user home with a bunch of files that were meant to be temporary.

hpages commented 1 month ago

This got addressed in last February in HDF5Array 1.31.4 (https://github.com/Bioconductor/HDF5Array/commit/48ba2888112efa57f4adda7b4b849cb4784a1ff1).