Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

Direct converters from H5SparseMatrixSeed-containing DelayedMatrix to dgCMatrix #45

Open LTLA opened 2 years ago

LTLA commented 2 years ago

I'm trying to convert a H5SparseMatrixSeed-containing DelayedMatrix to a dgCMatrix. This calls:

selectMethod(coerce, c("DelayedMatrix", "dgCMatrix") )
## Method Definition:
## 
## function (from, to = "dgCMatrix", strict = TRUE)
## as(as(from, "SparseArraySeed"), "dgCMatrix")
## <bytecode: 0x55555e41dde0>
## <environment: namespace:DelayedArray>
## 
## Signatures:
##         from            to
## target  "DelayedMatrix" "dgCMatrix"
## defined "Array"         "dgCMatrix"

Well, okay. But then this calls:

selectMethod(coerce, c("DelayedMatrix", "SparseArraySeed") )
## Method Definition:
## 
## function (from, to = "SparseArraySeed", strict = TRUE)
## .BLOCK_dense2sparse(from)
## <bytecode: 0x55555e4180f0>
## <environment: namespace:DelayedArray>
## 
## Signatures:
##         from            to
## target  "DelayedMatrix" "SparseArraySeed"
## defined "DelayedArray"  "SparseArraySeed"

Directly coercing H5SparseMatrixSeeds to SparseArraySeeds isn't much better:

selectMethod(coerce, c("H5SparseMatrixSeed", "SparseArraySeed") )
## Method Definition:
## 
## function (from, to = "SparseArraySeed", strict = TRUE)
## dense2sparse(from)
## <bytecode: 0x55555e420978>
## <environment: namespace:DelayedArray>
## 
## Signatures:
##         from                 to
## target  "H5SparseMatrixSeed" "SparseArraySeed"
## defined "ANY"                "SparseArraySeed"```

All in all, this makes it painfully slow to load a H5SparseMatrix into memory after applying some operations on it (e.g., like slapping on dimnames, which is commonly what we get out of SummarizedExperiment::assay()).

The solution is to simply define the missing sparse-to-sparse methods - something like the below might work:

setMethod("coerce", c("H5SparseMatrixSeed", "SparseArraySeed"), function(from) {
    extract_sparse_array(from, list(NULL, NULL)) 
}) 

setMethod("coerce", c("DelayedMatrix", "SparseArraySeed"), function(from) {
    if (is_sparse(from)) {
        extract_sparse_array(from, list(NULL, NULL))
    } else {
        .BLOCK_dense2sparse(from)
    }
})

# Ideally we should be able to do something like this, to bypass the intermediate
# SparseArraySeed object and directly return a dgCMatrix. This would enable
# very efficient conversions from CSC_H5SparseMatrixSeed representations, 
# while also adjusting for the common occurrence of assay() adding dimnames.
setMethod("coerce", c("DelayedMatrix", "dgCMatrix"), function(from) {
    if (isPristine(from, ignore.dimnames=TRUE)) {
        output <- as(from@seed, "dgCMatrix")
        dimnames(output) <- dimnames(from)
        output
    } else {
        as(as(from, "SparseArraySeed"), "dgCMatrix")
    }
})
Session info ``` R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS: /usr/local/lib/R/lib/libRblas.so LAPACK: /usr/local/lib/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] HDF5Array_1.21.0 rhdf5_2.37.0 DelayedArray_0.19.1 [4] IRanges_2.27.0 S4Vectors_0.31.0 MatrixGenerics_1.5.1 [7] matrixStats_0.59.0 BiocGenerics_0.39.1 Matrix_1.3-3 loaded via a namespace (and not attached): [1] compiler_4.1.0 tools_4.1.0 rhdf5filters_1.5.0 grid_4.1.0 [5] lattice_0.20-44 Rhdf5lib_1.15.2 ```