Open LTLA opened 1 year ago
Hi @LTLA,
Sorry for the slow response on this. Note that you can change the type of a DelayedArray object anytime with:
if (type(mat) == "raw")
type(mat) <- "integer"
In particular, you don't need to re-create the HDF5ArraySeed or H5SparseMatrixSeed object if its type turns out to be raw
.
type<-
is a delayed op so a big advantage of this approach is that it's completely generic i.e. it works on any DelayedArray object independently of the kind of seed (e.g. HDF5ArraySeed, H5SparseMatrixSeed, TileDBArraySeed, SparseArray, etc...), and regardless of whether it carries delayed operations or not, or whether it's sparse or not. It always works!
Would that be good enough or do you feel strongly about adding the as.integer
argument to the H5SparseMatrixSeed()
constructor? There might be a very small performance benefit in doing this, at least in theory, because the as.integer
argument would then be passed down to h5mread()
, which should be sligthly more efficient at turning those H5T_NATIVE_UINT8 values into integers. However I doubt it would be noticeable in practice. Also this change would require to add the as_integer
slot to the H5SparseMatrixSeed objects so that they can "remember" the value of the as.integer
argument.
Finally note that there would be no reason to not make similar changes to the HDF5ArraySeed()
and HDF5Array()
constructors, and to the HDF5ArraySeed class, for consistency.
H.
Would that be good enough or do you feel strongly about adding the
as.integer
argument to theH5SparseMatrixSeed()
constructor?
Yes, that would solve my immediate problem (of getting an integer sparse matrix).
However, this might cause some performance issues down the line... but not for the reasons you've stated. In my various frameworks (e.g., alabaster.xxx, chihaya), I can perform some optimizations if I detect that the DelayedArray
is pristine. For example, I could consider creating a symlink to the underlying HDF5 file when saving a HDF5Array
via alabaster.base::stageObject()
, rather than performing a round of block-processing to load and save data in a new file.
If I use the type<-
method, the DelayedArray
is no longer pristine as it gets a storage.mode<-
delayed op in a DelayedUnaryIsoOpStack
layer. This precludes the optimization mentioned above. Of course, I can get around this by more deeply inspecting the DelayedArray
's operations to ignore certain operations. This is a little tedious... which is acceptable, and normally I'd just go ahead and handle it on my end. But then I noticed that HDF5ArraySeed
already had a type=
option, so it seems like it wouldn't hurt to do the same for H5SparseMatrixSeed
for consistency.
(Also having thought about it some more, type=
would be fine. I don't really need an as_integer
option as I already know the R type I want to use to interpret the HDF5 dataset.)
Finally note that there would be no reason to not make similar changes to the
HDF5ArraySeed()
andHDF5Array()
constructors, and to the HDF5ArraySeed class, for consistency.
Yes, that's what I was thinking, given that both of these already have a type=
option in their constructors, whereas the sparse matrices do not.
FWIW I recently encountered a related problem, where my data
was stored as a H5T_NATIVE_UINT32
with values that exceeded .Machine$integer.max
. I would have liked to instruct H5SparseMatrix
to load them as doubles, but currently the code just silently caps the values at .Machine$integer.max
.
library(HDF5Array)
m0 <- matrix(0, nrow=25, ncol=12,
dimnames=list(letters[1:25], LETTERS[1:12]))
m0[cbind(2:24, c(12:1, 2:12))] <- 100 + sample(55, 23, replace=TRUE)
out_file <- tempfile()
M0 <- writeTENxMatrix(m0, out_file, group="m0")
# Let's make the data more exciting.
fhandle <- H5Fopen(out_file, "H5F_ACC_RDWR")
ghandle <- H5Gopen(fhandle, "m0")
H5Ldelete(ghandle, "data")
shandle <- H5Screate_simple(23)
dhandle <- H5Dcreate(ghandle, "data", dtype_id="H5T_NATIVE_UINT32", h5space=shandle)
H5Dwrite(dhandle, .Machine$integer.max + as.double(1:23))
H5Dclose(dhandle)
H5Sclose(shandle)
H5Gclose(ghandle)
H5Fclose(fhandle)
H5SparseMatrix(out_file, "m0")
## <25 x 12> sparse H5SparseMatrix object of type "integer":
## [,1] [,2] [,3] [,4] ... [,9] [,10] [,11] [,12]
## [1,] 0 0 0 0 . 0 0 0 0
## [2,] 0 0 0 0 . 0 0 0 2147483647
## [3,] 0 0 0 0 . 0 0 2147483647 0
## [4,] 0 0 0 0 . 0 2147483647 0 0
## [5,] 0 0 0 0 . 2147483647 0 0 0
## ... . . . . . . . . .
## [21,] 0 0 0 0 . 2147483647 0 0 0
## [22,] 0 0 0 0 . 0 2147483647 0 0
## [23,] 0 0 0 0 . 0 0 2147483647 0
## [24,] 0 0 0 0 . 0 0 0 2147483647
## [25,] 0 0 0 0 . 0 0 0 0
Sometimes I store sparse counts in the HDF5 file as unsigned 8-bit integers to save some space. This is fine but the subsequent
H5SparseMatrix
instance is not able to participate in arithmetic operations:This might be easily solved with a
type=
option, just like in theHDF5ArraySeed
constructor for the dense case.Or even better, a dedicated
as.integer=
option that treats all HDF5 integer types as R integers. This would allow me to just setas.integer=TRUE
and everything should work; otherwise, even with atype=
option, I need to first create theHDF5ArraySeed
with default arguments, check if it'stype(mat) == "raw"
, and then create it again withtype="integer"
. (Presumably, I can't just settype="integer"
all the time, otherwise bad things will happen for floating-point matrices.)Session information
``` R version 4.3.0 Patched (2023-05-04 r84398) Platform: aarch64-apple-darwin22.3.0 (64-bit) Running under: macOS Ventura 13.2.1 Matrix products: default BLAS: /Users/luna/Software/R/R-4-3-branch/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/R-4-3-branch/lib/libRlapack.dylib; LAPACK version 3.11.0 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: America/Los_Angeles tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] HDF5Array_1.29.3 DelayedArray_0.27.9 SparseArray_1.1.10 [4] S4Arrays_1.1.4 IRanges_2.35.2 S4Vectors_0.39.1 [7] MatrixGenerics_1.13.0 matrixStats_1.0.0 BiocGenerics_0.47.0 [10] Matrix_1.5-4.1 rhdf5_2.45.0 loaded via a namespace (and not attached): [1] zlibbioc_1.47.0 lattice_0.21-8 rhdf5filters_1.13.3 [4] XVector_0.41.1 Rhdf5lib_1.23.0 grid_4.3.0 [7] compiler_4.3.0 tools_4.3.0 crayon_1.5.2 ```