Bioconductor / HDF5Array

HDF5 backend for DelayedArray objects
https://bioconductor.org/packages/HDF5Array
9 stars 13 forks source link

Treating dense HDF5Matrices as sparse for greater efficiency #33

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

This is motivated by the fact that, aside from the data extraction from file, the other major time sink is the matrix multiplication process in memory. As multiplication passes over each submatrix multiple times, we can save time by doing a one-off conversion to a sparse format if we know beforehand that the input data is truly sparse.

To demonstrate, we modify the Seed class slightly:

setClass("HDF5ArraySeed",
    contains="Array",
    representation(
        filepath="character",       # Absolute path to the HDF5 file so the
                                    # object doesn't break when the user
                                    # changes the working directory (e.g.
                                    # with setwd()).
                                    # The path must also be in its canonical
                                    # form so paths from different objects
                                    # can be compared (required by
                                    # quickResaveHDF5SummarizedExperiment()).
        name="character",           # Name of the dataset in the HDF5 file.
        type="character",           # NA or the wanted type.
        dim="integer",
        chunkdim="integer_OR_NULL",
        first_val="ANY",            # First value in the dataset.
        sparse="logical"
    ),
    prototype(
        type=NA_character_,
        sparse=FALSE
    )
)

And then add:

setMethod("is_sparse", "HDF5ArraySeed", function(x) x@sparse)

setMethod("extract_sparse_array", "HDF5ArraySeed", function(x, index) {
    dense2sparse(extract_array(x, index))
})

Also install Bioconductor/DelayedArray#67. Then, we could run:

library(TENxPBMCData)
sce <- TENxPBMCData("pbmc4k")

mat <- counts(sce)
mult <- matrix(runif(ncol(mat)*50), ncol=50)
system.time(val <- mat %*% mult)
##    user  system elapsed 
##   6.858   0.264   7.123 

alt <- seed(mat)
alt@sparse <- TRUE
mat2 <- DelayedArray(alt)
system.time(val2 <- mat2 %*% mult)
##    user  system elapsed 
##  3.022   0.192   3.214 

Which is pretty nice. The only downside is that the conversion will slow down single-pass procedures:

system.time(colSums(mat))
##    user  system elapsed 
##   0.979   0.188   1.167 
system.time(colSums(mat2))
##    user  system elapsed 
##   2.302   0.136   2.437 

If this is unacceptable, we could specialize %*%, crossprod and tcrossprod to set some global flag so that is_sparse() only returns TRUE if @sparse is true and the global flag is also set.

hpages commented 3 years ago

Yes I'd like to add something like this. With the as.sparse argument added to the HDF5ArraySeed() and HDF5Array() constructor functions.

FWIW I spent the last 24h adding the as.sparse arg to HDFArray::h5mread() (the workhorse behind extract_array()). A SparseArraySeed object is returned when as.sparse=TRUE. The idea is to create the sparse representation as early as possible (at the chunk level). This saves a significant amount of memory compared to going sparse late:

library(HDF5Array)
library(ExperimentHub)
hub <- ExperimentHub()
fname1 <- hub[["EH1040"]]  # "dense" TENxBrainData dataset

index <- list(1:10000, 1:20000)  # a big block!
## Going sparse late:
sas1 <- dense2sparse(h5mread(fname1, "counts", index))  # uses 2.463g (as reported by 'top')
## Going sparse early:
sas2 <- h5mread(fname1, "counts", index, as.sparse=TRUE)  # uses 1.259g (as reported by 'top')

Unfortunately as.sparse=TRUE is also slower at the moment: times are 4.332s and 9.205s for the above calls on my laptop. The data is sparse enough and the chunks are pretty small for this dataset so it's an ideal situation. Not sure what's going on, need to investigate.

Because I'd really really like to have h5mread(..., as.sparse=TRUE) use less memory and be faster. Should be possible. Then making this available to the user via HDF5Array(..., as.sparse=TRUE) would make plenty of sense.

hpages commented 3 years ago

With HDF5Array 1.17.7:

library(HDF5Array)
library(ExperimentHub)
hub <- ExperimentHub()
fname1 <- hub[["EH1040"]]  # "dense" TENxBrainData dataset

index <- list(1:10000, 1:20000)  # a big block!
## Going sparse late:
sas1 <- dense2sparse(h5mread(fname1, "counts", index))  # 4.461 s / uses 2.430g (so says 'top')
## Going sparse early:
sas2 <- h5mread(fname1, "counts", index, as.sparse=TRUE)  # 4.781 s / uses 1.227g (so says 'top')

Going sparse early is no longer significantly slower than going sparse late but it uses half the memory. So we're good to go for an as.sparse argument to HDF5ArraySeed() and HDF5Array(). I'll add it tomorrow.

LTLA commented 3 years ago

Cool. Hopefully the colSums() is also faster now that there's no longer a need for the explicit conversion.

hpages commented 3 years ago

Maybe it will be faster on HDF5Array objects created with as.sparse=TRUE. But even if it's not, the deal here is that since it uses less memory you can make the blocks bigger or use more workers. This in turn should make things faster but only if the client code knows how to take advantage of the new as.sparse feature.

Got an additional speed improvement from 4.781 s to 3.964 s since yesterday. Sweet! Now going sparse early clearly beats going sparse late on both aspects, memory usage and speed.

hpages commented 3 years ago

This is done. Not much of an improvement for colSums() and family but a nice one for sum(), max(), etc... See https://github.com/Bioconductor/DelayedArray/commit/ce0f30801ddbd1328a209ca382241da544b952b0 for why.

LTLA commented 3 years ago

Check it out:

library(TENxPBMCData)
sce <- TENxPBMCData("pbmc4k")

mat <- counts(sce)
mult <- matrix(runif(ncol(mat)*50), ncol=50)
system.time(val <- mat %*% mult)
##    user  system elapsed
##  10.011   0.508  10.552

alt <- seed(mat)
alt@as_sparse <- TRUE
mat2 <- DelayedArray(alt)
system.time(val2 <- mat2 %*% mult)
##    user  system elapsed
##   1.783   0.210   1.996

So for what I actually care about:

library(BiocSingular)
set.seed(100)
system.time(thing1 <- runPCA(mat, rank=50, BSPARAM=RandomParam(deferred=TRUE)))
##    user  system elapsed
##  92.121   4.699  97.292

set.seed(100)
system.time(thing2 <- runPCA(mat2, rank=50, BSPARAM=RandomParam(deferred=TRUE)))
##    user  system elapsed
##  15.979   1.980  18.025

Very nice.

hpages commented 3 years ago

Yes! I like that. I'm adding an is_sparse() setter for HDF5Array/HDF5ArraySeed objects so the user can temporarily switch from one mode to the other without doing direct slot access.

hpages commented 3 years ago

Done: https://github.com/Bioconductor/HDF5Array/commit/46ae107bb82f05df9278e2aafdab48487cb21be2