Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

`H5SparseMatrix` coercion to `dgCmatrix` discards NAs #61

Open LTLA opened 7 months ago

LTLA commented 7 months ago

Possibly the real issue lies in SparseArray, but this is where I first encountered it, so I'll just post it here.

library(HDF5Array)

# Slight modification of ?writeTENxMatrix with an extra NaN in the corner.
m0 <- matrix(0, nrow=25, ncol=12,
    dimnames=list(letters[1:25], LETTERS[1:12]))
m0[cbind(2:24, c(12:1, 2:12))] <- 100 + sample(55, 23, replace=TRUE)
m0[1] <- NaN
out_file <- tempfile()
M0 <- writeTENxMatrix(m0, out_file, group="m0")

# Adding some trivial operation, presumably to force it to use the general
# DelayedArray->dgCMatrix coercion method, instead of TENxMatrix's specialization.
M1 <- M0 * 10
M1[1] # still NaN, good
## [1] NaN

# Now coercing and our NaN disappears.
M2 <- as(M1, "dgCMatrix")
M2[1] # ???
## [1] 0

As you can see, our NaN is lost - structurally, there isn't even a triplet at that location. I'd speculate that the coercion does some kind of filtering to remove zeros in the nzdata, and the NA/NaNs might have gotten screened out as well, e.g., by which().

Session information ``` R Under development (unstable) (2023-11-10 r85507) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 20.04.6 LTS Matrix products: default BLAS: /home/luna/Software/R/trunk/lib/libRblas.so LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so; LAPACK version 3.11.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/Los_Angeles tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] HDF5Array_1.31.0 rhdf5_2.47.0 DelayedArray_0.29.0 [4] SparseArray_1.3.1 S4Arrays_1.3.0 abind_1.4-5 [7] IRanges_2.37.0 S4Vectors_0.41.1 MatrixGenerics_1.15.0 [10] matrixStats_1.1.0 BiocGenerics_0.49.1 Matrix_1.6-2 loaded via a namespace (and not attached): [1] zlibbioc_1.49.0 lattice_0.22-5 rhdf5filters_1.15.1 [4] XVector_0.43.0 Rhdf5lib_1.25.0 grid_4.4.0 [7] compiler_4.4.0 tools_4.4.0 crayon_1.5.2 ```