Closed LTLA closed 3 years ago
Yes I'd like to add something like this. With the as.sparse
argument added to the HDF5ArraySeed()
and HDF5Array()
constructor functions.
FWIW I spent the last 24h adding the as.sparse
arg to HDFArray::h5mread()
(the workhorse behind extract_array()
). A SparseArraySeed object is returned when as.sparse=TRUE
. The idea is to create the sparse representation as early as possible (at the chunk level). This saves a significant amount of memory compared to going sparse late:
library(HDF5Array)
library(ExperimentHub)
hub <- ExperimentHub()
fname1 <- hub[["EH1040"]] # "dense" TENxBrainData dataset
index <- list(1:10000, 1:20000) # a big block!
## Going sparse late:
sas1 <- dense2sparse(h5mread(fname1, "counts", index)) # uses 2.463g (as reported by 'top')
## Going sparse early:
sas2 <- h5mread(fname1, "counts", index, as.sparse=TRUE) # uses 1.259g (as reported by 'top')
Unfortunately as.sparse=TRUE
is also slower at the moment: times are 4.332s and 9.205s for the above calls on my laptop. The data is sparse enough and the chunks are pretty small for this dataset so it's an ideal situation. Not sure what's going on, need to investigate.
Because I'd really really like to have h5mread(..., as.sparse=TRUE)
use less memory and be faster. Should be possible. Then making this available to the user via HDF5Array(..., as.sparse=TRUE)
would make plenty of sense.
With HDF5Array 1.17.7:
library(HDF5Array)
library(ExperimentHub)
hub <- ExperimentHub()
fname1 <- hub[["EH1040"]] # "dense" TENxBrainData dataset
index <- list(1:10000, 1:20000) # a big block!
## Going sparse late:
sas1 <- dense2sparse(h5mread(fname1, "counts", index)) # 4.461 s / uses 2.430g (so says 'top')
## Going sparse early:
sas2 <- h5mread(fname1, "counts", index, as.sparse=TRUE) # 4.781 s / uses 1.227g (so says 'top')
Going sparse early is no longer significantly slower than going sparse late but it uses half the memory. So we're good to go for an as.sparse
argument to HDF5ArraySeed()
and HDF5Array()
. I'll add it tomorrow.
Cool. Hopefully the colSums()
is also faster now that there's no longer a need for the explicit conversion.
Maybe it will be faster on HDF5Array objects created with as.sparse=TRUE
. But even if it's not, the deal here is that since it uses less memory you can make the blocks bigger or use more workers. This in turn should make things faster but only if the client code knows how to take advantage of the new as.sparse
feature.
Got an additional speed improvement from 4.781 s to 3.964 s since yesterday. Sweet! Now going sparse early clearly beats going sparse late on both aspects, memory usage and speed.
This is done. Not much of an improvement for colSums()
and family but a nice one for sum()
, max()
, etc... See https://github.com/Bioconductor/DelayedArray/commit/ce0f30801ddbd1328a209ca382241da544b952b0 for why.
Check it out:
library(TENxPBMCData)
sce <- TENxPBMCData("pbmc4k")
mat <- counts(sce)
mult <- matrix(runif(ncol(mat)*50), ncol=50)
system.time(val <- mat %*% mult)
## user system elapsed
## 10.011 0.508 10.552
alt <- seed(mat)
alt@as_sparse <- TRUE
mat2 <- DelayedArray(alt)
system.time(val2 <- mat2 %*% mult)
## user system elapsed
## 1.783 0.210 1.996
So for what I actually care about:
library(BiocSingular)
set.seed(100)
system.time(thing1 <- runPCA(mat, rank=50, BSPARAM=RandomParam(deferred=TRUE)))
## user system elapsed
## 92.121 4.699 97.292
set.seed(100)
system.time(thing2 <- runPCA(mat2, rank=50, BSPARAM=RandomParam(deferred=TRUE)))
## user system elapsed
## 15.979 1.980 18.025
Very nice.
Yes! I like that. I'm adding an is_sparse()
setter for HDF5Array/HDF5ArraySeed objects so the user can temporarily switch from one mode to the other without doing direct slot access.
This is motivated by the fact that, aside from the data extraction from file, the other major time sink is the matrix multiplication process in memory. As multiplication passes over each submatrix multiple times, we can save time by doing a one-off conversion to a sparse format if we know beforehand that the input data is truly sparse.
To demonstrate, we modify the Seed class slightly:
And then add:
Also install Bioconductor/DelayedArray#67. Then, we could run:
Which is pretty nice. The only downside is that the conversion will slow down single-pass procedures:
If this is unacceptable, we could specialize
%*%
,crossprod
andtcrossprod
to set some global flag so thatis_sparse()
only returnsTRUE
if@sparse
is true and the global flag is also set.