Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

How to create a huge on-disk array directly from R #41

Closed kokitsuyuzaki closed 3 years ago

kokitsuyuzaki commented 3 years ago

To analyze a huge multi-dimensional array, I checked some on-disk implementations such as DelayedArray, HDF5Array, and TileDBArray but all of them seem to assume that a huge array is already stored in HDF5 or TileDB and if we want to create an on-disk array in R, we can create only a small array that can fit in memory.

For example, in this code, small_arr can be created but large_arr cannot be created because we have to create a huge in-memory array first and then it will be converted to RleArray.

library("HDF5Array")
small_arr <- HDF5Array::writeHDF5Array(
    array(runif(10*20*30), dim=c(10,20,30)))
large_arr <- HDF5Array::writeHDF5Array(
    array(runif(10000*1000*1000), dim=c(10000,1000,1000)))

Would it be possible to create an HDF5 file and later define the size and values of the arrays to be stored in it as follows?

large_arr <- HDF5ArraySeed(
    filepath = "seed.h5",
    name = "arr"
    )
# size
dim(large_arr) <- c(10000,1000,1000)
# value
for(i in seq(10000)){
    large_arr[i,,] <- sample(1000*1000)
}

Although I found that HDF5Array (HDF5ArraySeed) assumes that input object is array. https://github.com/Bioconductor/HDF5Array/blob/5289a84f2fbeaa7610a73dcc1b704a25fe8cc1cd/R/HDF5ArraySeed-class.R#L7

vjcitn commented 3 years ago

IMHO this question is more relevant to rhdf5; I will mention TileDb at the end. You can initialize an HDF5 disk store with known or unlimited dimensions and then populate it piece by piece as allowed by your systems. If it is not clear how to do this with instructions at, e.g., https://bioconductor.org/packages/release/bioc/vignettes/rhdf5/inst/doc/rhdf5.html#creating-an-hdf5-file-and-group-hierarchy, go as far as you can and then pose further questions at support.bioconductor.org, tagging with rhdf5. This probably should be an FAQ. Once you have the HDF5 resource on disk, you can apply DelayedArray methods. For TileDb, it might be effective to pose the question at https://forum.tiledb.com/, but support.bioconductor.org is also an option.

hpages commented 3 years ago

As Vince said, writing your own arbitrary data to an HDF5 file doesn't need to involve DelayedArray objects and can be easily achieved with plain use of the rhdf5 package. However, the DelayedArray/HDF5Array framework provides RealizationSink objects and the write_block() function to make this more convenient, and to abstract away the details of the particular backend being used (e.g. HDF5 file or TileTB). This helps make the code simpler, easier to understand, and portable across backends.

See ?write_block in the DelayedArray package for more information.

Please note that, whatever you use (rhdf5 package directly or RealizationSink object + write_block()), there's no requirement that the array must be small enough to fit in memory. You should be able to create on-disk arrays of arbitrary size as long you have enough space on your hard drive.

H.

kokitsuyuzaki commented 3 years ago

Thank you so much @vjcitn and @hpages

I'll check the document of rhdf5, RealizationSink, and write_block.

hpages commented 3 years ago

Great. I'm closing this. If you have further questions, please open a new issue (on the rhdf5 repo if the question is rhdf5 related), or ask on the Bioconductor support site. Thanks!