Closed rcastelo closed 4 years ago
Hi Robert,
This is done in HDF5Array 1.15.4 (commit 18e667bd). With the following gotcha (the write_block()
method for HDF5RealizationSink objects is what writeHDF5Array()
uses internally to write blocks of data to the disk): https://github.com/Bioconductor/HDF5Array/blob/cc38df276943ecc6a0801febd032175dbb2b3be8/R/writeHDF5Array.R#L121-L140
At some point someone would need to submit a patch to rhdf5 to enable full "raw"
support there but I don't have time to work on this at the moment.
Let me know how that works for you.
H.
Thanks Hervé for your quick fix, it works like charm. I looked up the code of rhdf5
and filed a pull request with a fix, which i checked it was working by commeting out the coercion to integer that you introduce. however, the resulting HDF5 file still has the same size as the one produced by coercing to integer. so i'm unsure my pull request resolves entirely the support for the 'raw' datatype. let's see whether the maintainers of rhdf5
have a further update.
The coercion to integer is not expected to blow the size of the file, only the memory footprint of the block being written. But when the block data hit the disk, it's still written as H5T_STD_U8LE, that is, 1 byte is used for each array value. Or less if you use chunking + compression.
I think your PR is missing a few things. For example rhdf5::h5createDataset()
also needs to be fixed to use the right H5 type when storage.mode
is set to "raw"
. See the workaround I use in HDF5Array. Since h5write()
calls h5createDataset()
internally, bad things will probably happen if h5createDataset()
does not support storage.mode="raw"
.
There might be other places that need to be looked at. Ideally the PR should include unit tests that provide good coverage of the "raw"
situation.
Thanks!
How can i use "chunking + compression" to reduce the size of the file? could you point me to some documentation?
I'm not familiar with the internals such as the "right H5 type", isn't it what you described as `H5T_STD_U8LE'?
i'm not surprised that my PR is incomplete because i just fixed the error but this was the first time i looked at the source of rhdf5
. @grimbough said he'll have a look to the PR in the next few days, so hopefully he finds out all the necessary updates.
See ?h5createDataset
for how to control chunking and compression level. That's if you use rhdf5 directly. If you use HDF5Array, see ?writeHDF5Array
.
By "right H5 type" I meant an H5 type that matches the size of "raw"
elements in R i.e. an 8-bit type.
Thanks @hpages for your help with this!!!
No problem. It's great to have good support for "raw"
data in rhdf5/HDF5Array.
hi, i'm interested in storing
raw-Rle
vectors asRleArray
objects using aHDF5Array
backend. if i just try that out-of-the-box, i get an error that thedatatype raw not yet implemented
:an obvious alternative is to store those values using integers:
and converting those integers to raw "on the fly" when reading from the HDF5 file:
however, in fact, a 'raw' type in R is nothing else than a char in the underlying C language, so should be able to store the 'raw' data type using chars:
and converting those chars to raw "on the fly" when reading from the HDF5 file:
however, on the one hand, the char-based HDF5 file, where in principle each value occupies only one byte, is twice as large as the integer-based one, just the opposite of what i was expecting.
on the other hand, the conversion on the fly for this char-based HDF5 file is much more costly since it involves a
sapply()
operation.would it be possible to extend HDF5Array to efficiently store and access the 'raw' datatype from R?
i'm also read with interest the closed issue on having an actual Rle representation in the HDF5 file but i understand this is a complicated issue. in any case, if you ever considered adding it, you have my support :)