Closed ekernf01 closed 3 years ago
Hi @ekernf01 ,
You want to write your own arbitrary data to an HDF5 file. This doesn't need to involve DelayedArray objects and can be easily achieved with plain use of the rhdf5 package. However, the DelayedArray/HDF5Array framework provides RealizationSink objects to make this more convenient, and to abstract away the details of the particular backend being used (e.g. HDF5 file or TileTB). This helps make the code simpler, easier to understand, and portable across backends.
See ?write_block
in the DelayedArray package for more information. I think that the first example (USING THE "RealizationSink API": EXAMPLE 1) in the examples section does something close to what you are trying to achieve, so hopefully it will be easy to adapt to your particular use case.
H.
Thanks, I adapted that RealizationSink example and it works really well. If I want to just us rhdf5 in the future, can the DelayedArray package read any hdf5 file, or are there certain expectations that have to be met? I don't know much about hdf5 yet, so please forgive me if it's an ignorant question, but I'm asking because in the past I have had some trouble writing hdf5 files with one scRNA package and then trying to read them with another.
The DelayedArray package implements all the backend agnostic stuff used by DelayedArray objects in general so is not geared specifically towards hdf5 datasets.
The HDF5Array()
constructor in the HDF5Array package should be able to read most hdf5 datasets. There are no particular expectations to be met. However performance of the HDF5Array object will depend a lot on some important parameters like chunk geometry, compression level, and storage type, etc... that control how the dataset is physically stored on disk. All these parameters need to be decided ahead of time when the dataset is written to disk. The chunk geometry is probably the most important one and the best geometry will ultimately depend on the typical access pattern of your downstream analysis.
These parameters are documented in ?writeHDF5Array
in the HDF5Array package. The HDF5RealizationSink()
constructor has the same arguments as the writeHDF5Array()
function.
Thanks! After a little trouble with locking, I got that to work too now. I will close the issue.
FWIW, the original question can also be answered with:
library(DelayedRandomArray) # see https://github.com/LTLA/DelayedRandomArray
randnorm <- RandomNormArray(c(1000, 100000))
library(HDF5Array)
writeHDF5Array(randnorm, file="foo.h5", path="bar") # generates a pretty large file; not very compressible.
Takes about 20 seconds for me.
Hi DelayedArray devs, how would you fill a 1000 by 100,000 DelayedArray or HDF5Array with iid standard Normal draws? Here's what I have tried.
simplify
. Seems to make no difference to the too-big tree.realize
with HDF5Array backend. This works but seems slow.Thanks in advance for considering this request. The package is awesome -- easy to use and valuable.