Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

Interpreting `NATIVE_UINT8` as integer in sparse matrices #56

Open LTLA opened 1 year ago

LTLA commented 1 year ago

Sometimes I store sparse counts in the HDF5 file as unsigned 8-bit integers to save some space. This is fine but the subsequent H5SparseMatrix instance is not able to participate in arithmetic operations:

library(rhdf5)
y <- abs(round(Matrix::rsparsematrix(100, 100, 0.01) * 10))

tmp <- tempfile(fileext=".h5")
h5createFile(tmp)
h5createGroup(tmp, "matrix")
h5createDataset(tmp, "matrix/data", length(y@x), H5type="H5T_NATIVE_UINT8")
h5write(y@i, tmp, "matrix/indices", length(y@i))
h5write(y@p, tmp, "matrix/indptr", length(y@p))

fhandle <- H5Fopen(tmp)
ghandle <- H5Gopen(fhandle, "matrix")
h5writeDataset(y@x, ghandle, "data")
H5Gclose(ghandle)
H5Fclose(fhandle)

library(HDF5Array)
seed <- H5SparseMatrixSeed(tmp, "matrix", dim=c(100, 100), sparse.layout="csc")
mat <- DelayedArray(seed)
type(mat)
## [1] "raw"

mat + 1
## Error in h(simpleError(msg, call)) :
##   error in evaluating the argument 'x' in selecting a method for function 'type': non-numeric argument to binary operator

This might be easily solved with a type= option, just like in the HDF5ArraySeed constructor for the dense case.

Or even better, a dedicated as.integer= option that treats all HDF5 integer types as R integers. This would allow me to just set as.integer=TRUE and everything should work; otherwise, even with a type= option, I need to first create the HDF5ArraySeed with default arguments, check if it's type(mat) == "raw", and then create it again with type="integer". (Presumably, I can't just set type="integer" all the time, otherwise bad things will happen for floating-point matrices.)

Session information ``` R version 4.3.0 Patched (2023-05-04 r84398) Platform: aarch64-apple-darwin22.3.0 (64-bit) Running under: macOS Ventura 13.2.1 Matrix products: default BLAS: /Users/luna/Software/R/R-4-3-branch/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/R-4-3-branch/lib/libRlapack.dylib; LAPACK version 3.11.0 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: America/Los_Angeles tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] HDF5Array_1.29.3 DelayedArray_0.27.9 SparseArray_1.1.10 [4] S4Arrays_1.1.4 IRanges_2.35.2 S4Vectors_0.39.1 [7] MatrixGenerics_1.13.0 matrixStats_1.0.0 BiocGenerics_0.47.0 [10] Matrix_1.5-4.1 rhdf5_2.45.0 loaded via a namespace (and not attached): [1] zlibbioc_1.47.0 lattice_0.21-8 rhdf5filters_1.13.3 [4] XVector_0.41.1 Rhdf5lib_1.23.0 grid_4.3.0 [7] compiler_4.3.0 tools_4.3.0 crayon_1.5.2 ```
hpages commented 9 months ago

Hi @LTLA,

Sorry for the slow response on this. Note that you can change the type of a DelayedArray object anytime with:

if (type(mat) == "raw")
    type(mat) <- "integer"

In particular, you don't need to re-create the HDF5ArraySeed or H5SparseMatrixSeed object if its type turns out to be raw.

type<- is a delayed op so a big advantage of this approach is that it's completely generic i.e. it works on any DelayedArray object independently of the kind of seed (e.g. HDF5ArraySeed, H5SparseMatrixSeed, TileDBArraySeed, SparseArray, etc...), and regardless of whether it carries delayed operations or not, or whether it's sparse or not. It always works!

Would that be good enough or do you feel strongly about adding the as.integer argument to the H5SparseMatrixSeed() constructor? There might be a very small performance benefit in doing this, at least in theory, because the as.integer argument would then be passed down to h5mread(), which should be sligthly more efficient at turning those H5T_NATIVE_UINT8 values into integers. However I doubt it would be noticeable in practice. Also this change would require to add the as_integer slot to the H5SparseMatrixSeed objects so that they can "remember" the value of the as.integer argument.

Finally note that there would be no reason to not make similar changes to the HDF5ArraySeed() and HDF5Array() constructors, and to the HDF5ArraySeed class, for consistency.

H.

LTLA commented 9 months ago

Would that be good enough or do you feel strongly about adding the as.integer argument to the H5SparseMatrixSeed() constructor?

Yes, that would solve my immediate problem (of getting an integer sparse matrix).

However, this might cause some performance issues down the line... but not for the reasons you've stated. In my various frameworks (e.g., alabaster.xxx, chihaya), I can perform some optimizations if I detect that the DelayedArray is pristine. For example, I could consider creating a symlink to the underlying HDF5 file when saving a HDF5Array via alabaster.base::stageObject(), rather than performing a round of block-processing to load and save data in a new file.

If I use the type<- method, the DelayedArray is no longer pristine as it gets a storage.mode<- delayed op in a DelayedUnaryIsoOpStack layer. This precludes the optimization mentioned above. Of course, I can get around this by more deeply inspecting the DelayedArray's operations to ignore certain operations. This is a little tedious... which is acceptable, and normally I'd just go ahead and handle it on my end. But then I noticed that HDF5ArraySeed already had a type= option, so it seems like it wouldn't hurt to do the same for H5SparseMatrixSeed for consistency.

(Also having thought about it some more, type= would be fine. I don't really need an as_integer option as I already know the R type I want to use to interpret the HDF5 dataset.)

Finally note that there would be no reason to not make similar changes to the HDF5ArraySeed() and HDF5Array() constructors, and to the HDF5ArraySeed class, for consistency.

Yes, that's what I was thinking, given that both of these already have a type= option in their constructors, whereas the sparse matrices do not.

LTLA commented 7 months ago

FWIW I recently encountered a related problem, where my data was stored as a H5T_NATIVE_UINT32 with values that exceeded .Machine$integer.max. I would have liked to instruct H5SparseMatrix to load them as doubles, but currently the code just silently caps the values at .Machine$integer.max.

library(HDF5Array)

m0 <- matrix(0, nrow=25, ncol=12,
    dimnames=list(letters[1:25], LETTERS[1:12]))
m0[cbind(2:24, c(12:1, 2:12))] <- 100 + sample(55, 23, replace=TRUE)
out_file <- tempfile()
M0 <- writeTENxMatrix(m0, out_file, group="m0")

# Let's make the data more exciting.
fhandle <- H5Fopen(out_file, "H5F_ACC_RDWR")
ghandle <- H5Gopen(fhandle, "m0")
H5Ldelete(ghandle, "data")

shandle <- H5Screate_simple(23)
dhandle <- H5Dcreate(ghandle, "data", dtype_id="H5T_NATIVE_UINT32", h5space=shandle)
H5Dwrite(dhandle, .Machine$integer.max + as.double(1:23))

H5Dclose(dhandle)
H5Sclose(shandle)
H5Gclose(ghandle)
H5Fclose(fhandle)

H5SparseMatrix(out_file, "m0")
## <25 x 12> sparse H5SparseMatrix object of type "integer":
##        [,1]  [,2]  [,3]  [,4] ...       [,9]      [,10]      [,11]      [,12]
##  [1,]     0     0     0     0   .          0          0          0          0
##  [2,]     0     0     0     0   .          0          0          0 2147483647
##  [3,]     0     0     0     0   .          0          0 2147483647          0
##  [4,]     0     0     0     0   .          0 2147483647          0          0
##  [5,]     0     0     0     0   . 2147483647          0          0          0
##   ...     .     .     .     .   .          .          .          .          .
## [21,]     0     0     0     0   . 2147483647          0          0          0
## [22,]     0     0     0     0   .          0 2147483647          0          0
## [23,]     0     0     0     0   .          0          0 2147483647          0
## [24,]     0     0     0     0   .          0          0          0 2147483647
## [25,]     0     0     0     0   .          0          0          0          0