Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

Filling large DelayedArray with iid standard normals #89

Closed ekernf01 closed 3 years ago

ekernf01 commented 3 years ago

Hi DelayedArray devs, how would you fill a 1000 by 100,000 DelayedArray or HDF5Array with iid standard Normal draws? Here's what I have tried.

Thanks in advance for considering this request. The package is awesome -- easy to use and valuable.

hpages commented 3 years ago

Hi @ekernf01 ,

You want to write your own arbitrary data to an HDF5 file. This doesn't need to involve DelayedArray objects and can be easily achieved with plain use of the rhdf5 package. However, the DelayedArray/HDF5Array framework provides RealizationSink objects to make this more convenient, and to abstract away the details of the particular backend being used (e.g. HDF5 file or TileTB). This helps make the code simpler, easier to understand, and portable across backends.

See ?write_block in the DelayedArray package for more information. I think that the first example (USING THE "RealizationSink API": EXAMPLE 1) in the examples section does something close to what you are trying to achieve, so hopefully it will be easy to adapt to your particular use case.

H.

ekernf01 commented 3 years ago

Thanks, I adapted that RealizationSink example and it works really well. If I want to just us rhdf5 in the future, can the DelayedArray package read any hdf5 file, or are there certain expectations that have to be met? I don't know much about hdf5 yet, so please forgive me if it's an ignorant question, but I'm asking because in the past I have had some trouble writing hdf5 files with one scRNA package and then trying to read them with another.

hpages commented 3 years ago

The DelayedArray package implements all the backend agnostic stuff used by DelayedArray objects in general so is not geared specifically towards hdf5 datasets.

The HDF5Array() constructor in the HDF5Array package should be able to read most hdf5 datasets. There are no particular expectations to be met. However performance of the HDF5Array object will depend a lot on some important parameters like chunk geometry, compression level, and storage type, etc... that control how the dataset is physically stored on disk. All these parameters need to be decided ahead of time when the dataset is written to disk. The chunk geometry is probably the most important one and the best geometry will ultimately depend on the typical access pattern of your downstream analysis.

These parameters are documented in ?writeHDF5Array in the HDF5Array package. The HDF5RealizationSink() constructor has the same arguments as the writeHDF5Array() function.

ekernf01 commented 3 years ago

Thanks! After a little trouble with locking, I got that to work too now. I will close the issue.

LTLA commented 3 years ago

FWIW, the original question can also be answered with:

library(DelayedRandomArray) # see https://github.com/LTLA/DelayedRandomArray
randnorm <- RandomNormArray(c(1000, 100000)) 

library(HDF5Array)
writeHDF5Array(randnorm, file="foo.h5", path="bar") # generates a pretty large file; not very compressible.

Takes about 20 seconds for me.