Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

Storing raw-Rle with RleArray on a HDF5Array backend #23

Closed rcastelo closed 4 years ago

rcastelo commented 4 years ago

hi, i'm interested in storing raw-Rle vectors as RleArray objects using a HDF5Array backend. if i just try that out-of-the-box, i get an error that the datatype raw not yet implemented:

## create a long raw-Rle vector of random non-negative integer values 0:255
set.seed(123)
len <- sample(1:10000, size=10000, replace=TRUE)
val <- sample(0:255, size=10000, replace=TRUE)
rawrle <- Rle(as.raw(rep(val, len)))
saveRDS(rawrle, file="rawrle.rds")
file.size("rawrle.rds") ## the object takes about 35Kb as RDS file
[1] 35516

## create 'RleArray' object from raw
rawrlearr <- RleArray(rawrle, dim=length(rawrle))

## create 'HDF5Array' file from 'RleArray' storing raw values
rawrlearrh5 <- writeHDF5Array(rawrlearr, "rawrlearr.h5", "rlearr")
Error in .setDataType(H5type, storage.mode, size) : 
  datatype raw not yet implemented.
Try 'logical', 'double', 'integer', 'integer64' or 'character'.

an obvious alternative is to store those values using integers:

## coerce the raw-Rle vector to integer-Rle
intrle <- rawrle
runValue(intrle) <- as.integer(runValue(intrle))

## create 'RleArray' object from integer
intrlearr <- RleArray(intrle, dim=length(intrle))

## create 'HDF5' file from integer
intrlearrh5 <- writeHDF5Array(intrlearr, "intrlearr.h5", "rlearr")
file.size("intrlearr.h5") ## the HDF5 file takes about 225Kb
[1] 230327

and converting those integers to raw "on the fly" when reading from the HDF5 file:

int2rawrlearr <- function(obj, pos) Rle(as.raw(obj[pos]))
stopifnot(identical(int2rawrlearr(intrlearrh5, 1:1000000), rawrle[1:1000000]))

however, in fact, a 'raw' type in R is nothing else than a char in the underlying C language, so should be able to store the 'raw' data type using chars:

## coerce the raw-Rle vector to character-Rle
charrle <- rawrle
runValue(charrle) <- rawToChar(runValue(charrle), multiple=TRUE)

## create 'RleArray' object from character
charrlearr <- RleArray(charrle, dim=length(charrle))

## create 'HDF5' file from character
charrlearrh5 <- writeHDF5Array(charrlearr, "charrlearr.h5", "rlearr")
file.size("intrlearr.h5") ## the HDF5 file takes about 465Kb
[1] 475970

and converting those chars to raw "on the fly" when reading from the HDF5 file:

asc <- function(char, simplify=TRUE)
         sapply(char, function(x) strtoi(charToRaw(x), 16L), simplify = simplify)
char2rawrlearr <- function(obj, pos) Rle(as.raw(asc(obj[pos])))
stopifnot(identical(char2rawrlearr(charrlearrh5, 1:1000000), rawrle[1:1000000]))

however, on the one hand, the char-based HDF5 file, where in principle each value occupies only one byte, is twice as large as the integer-based one, just the opposite of what i was expecting.

on the other hand, the conversion on the fly for this char-based HDF5 file is much more costly since it involves a sapply() operation.

would it be possible to extend HDF5Array to efficiently store and access the 'raw' datatype from R?

i'm also read with interest the closed issue on having an actual Rle representation in the HDF5 file but i understand this is a complicated issue. in any case, if you ever considered adding it, you have my support :)

hpages commented 4 years ago

Hi Robert,

This is done in HDF5Array 1.15.4 (commit 18e667bd). With the following gotcha (the write_block() method for HDF5RealizationSink objects is what writeHDF5Array() uses internally to write blocks of data to the disk): https://github.com/Bioconductor/HDF5Array/blob/cc38df276943ecc6a0801febd032175dbb2b3be8/R/writeHDF5Array.R#L121-L140

At some point someone would need to submit a patch to rhdf5 to enable full "raw" support there but I don't have time to work on this at the moment.

Let me know how that works for you.

H.

rcastelo commented 4 years ago

Thanks Hervé for your quick fix, it works like charm. I looked up the code of rhdf5 and filed a pull request with a fix, which i checked it was working by commeting out the coercion to integer that you introduce. however, the resulting HDF5 file still has the same size as the one produced by coercing to integer. so i'm unsure my pull request resolves entirely the support for the 'raw' datatype. let's see whether the maintainers of rhdf5 have a further update.

hpages commented 4 years ago

The coercion to integer is not expected to blow the size of the file, only the memory footprint of the block being written. But when the block data hit the disk, it's still written as H5T_STD_U8LE, that is, 1 byte is used for each array value. Or less if you use chunking + compression.

I think your PR is missing a few things. For example rhdf5::h5createDataset() also needs to be fixed to use the right H5 type when storage.mode is set to "raw". See the workaround I use in HDF5Array. Since h5write() calls h5createDataset() internally, bad things will probably happen if h5createDataset() does not support storage.mode="raw".

There might be other places that need to be looked at. Ideally the PR should include unit tests that provide good coverage of the "raw" situation.

Thanks!

rcastelo commented 4 years ago

How can i use "chunking + compression" to reduce the size of the file? could you point me to some documentation?

I'm not familiar with the internals such as the "right H5 type", isn't it what you described as `H5T_STD_U8LE'?

i'm not surprised that my PR is incomplete because i just fixed the error but this was the first time i looked at the source of rhdf5. @grimbough said he'll have a look to the PR in the next few days, so hopefully he finds out all the necessary updates.

hpages commented 4 years ago

See ?h5createDataset for how to control chunking and compression level. That's if you use rhdf5 directly. If you use HDF5Array, see ?writeHDF5Array.

By "right H5 type" I meant an H5 type that matches the size of "raw" elements in R i.e. an 8-bit type.

rcastelo commented 4 years ago

Thanks @hpages for your help with this!!!

hpages commented 4 years ago

No problem. It's great to have good support for "raw" data in rhdf5/HDF5Array.