Bioconductor / SparseArray

High-performance sparse data representation and manipulation in R
6 stars 2 forks source link

Feature request: Method for coercing delayed array to SVT_SparseArray #13

Open ekageyama opened 2 weeks ago

ekageyama commented 2 weeks ago

Currently it is tricky to convert a delayed array to SVT format, since there is no default or method for coercion.

hpages commented 2 weeks ago

One way to do this at the moment is to go thru the COO_SparseArray representation i.e. to do as(as(<DelayedArray>, "COO_SparseArray"), "SVT_SparseArray"). The first coercion will use block processing so won't necessarily be very efficient. The second coercion (from COO_SparseArray to SVT_SparseArray) should be quite efficient though.

But yeah, we should be able to just do as(<DelayedArray>, "SVT_SparseArray") or realize(<DelayedArray>) (the latter will soon be modified to return an SVT_SparseArray when the DelayedArray object is sparse). This is on my TODO list.

FYI I recently added specialized coercion methods to go from TENxMatrix, H5ADMatrix, H5SparseMatrix, TENxMatrixSeed, CSC_H5ADMatrixSeed, and CSC_H5SparseMatrixSeed, to SVT_SparseMatrix. These are quite efficient. Also they can handle big sparse datasets (i.e. datasets with more than 2^31-1 nonzero values) like the "1.3 Million Brain Cell Dataset" from 10x Genomics, as long as your machine has enough RAM:

library(HDF5Array)
library(ExperimentHub)
hub <- ExperimentHub()
fname <- hub[["EH1039"]]
oneM <- TENxMatrix(fname, group="mm10")
svt <- as(oneM, "SVT_SparseMatrix")  # takes about 1.5 min and consumes about 22g of RAM

This is with HDF5Array 1.33.3 (latest devel version).