Bioconductor / DelayedArray

A unified framework for working transparently with on-disk and in-memory array-like datasets
https://bioconductor.org/packages/DelayedArray
24 stars 9 forks source link

Add row/colsum fallbacks for non-DA/base matrix classes. #57

Closed LTLA closed 4 years ago

LTLA commented 4 years ago

Pretty much as it says. The aim is to provide some slow inefficient fallbacks for these generics so that they can operate on sparse matrices (or anything with a colSums and rowSums method).

Perhaps these won't live in DelayedArray in the long term, but this is a good enough location for the time being; it allows me to actually use these generics in my various packages. Right now I'm wrapping colsum in another scater generic (.colsum) in order to catch the non-DA/non-base case.

hpages commented 4 years ago

This should not be necessary. DelayedArray:::.BLOCK_rowsum() and DelayedArray:::.BLOCK_colsum() are fast and work on anything that satisfies the "seed contract" (i.e. supports dim(), dimnames(), and extract_array()). That includes sparse matrices:

library(Matrix)
library(DelayedArray)

rowsum_ANY <- function(x, group, reorder=TRUE, ...)
{
    by.group <- split(seq_len(nrow(x)), group)
    out <- lapply(by.group, function(i) colSums(x[i,,drop=FALSE]))
    out <- do.call(rbind, out)
    if (!reorder) {
        out <- out[unique(group),,drop=FALSE]
    }
    out
}

m <- matrix(runif(1.2e8), nrow=12000)
m0 <- as(ifelse(m <= 0.05, m, 0), "dgCMatrix")  # 5% non-zero values
group <- sample(777, nrow(m0), replace=TRUE)

system.time(rs1 <- DelayedArray:::.BLOCK_rowsum(m0, group))
#    user  system elapsed 
#   0.884   0.927   0.941 

system.time(rs2 <- rowsum_ANY(m0, group))
#    user  system elapsed 
#  12.968   0.040  13.008 

identical(rs1, rs2)
# [1] TRUE

So either you wrap your sparse matrix in a DelayedArray object and call rowsum() on it (which is equivalent to calling DelayedArray:::.BLOCK_rowsum() directly on the sparse matrix):

system.time(rs3 <- rowsum(DelayedArray(m0), group))
#    user  system elapsed 
#   0.979   0.822   0.950 

identical(rs1, rs3)
# [1] TRUE

or, if that's not convenient enough, we could just temporarily add the 2 following lines to DelayedArray (until the sparseMatrixStats package provides a native implementation for this):

setMethod("rowsum", "sparseMatrix", .BLOCK_rowsum)
setMethod("colsum", "sparseMatrix", .BLOCK_colsum)
LTLA commented 4 years ago

Oh, good.

or, if that's not convenient enough, we could just temporarily add the 2 following lines to DelayedArray (until the sparseMatrixStats package provides a native implementation for this):

Well, if it's going to be a stopgap, I would like a fallback ANY method that works for any matrix input. Seems like the current ANY method should really be the matrix method, and the true ANY method would actually call .BLOCK_*sum.

hpages commented 4 years ago

Seems like the current ANY method should really be the matrix method, and the true ANY method would actually call .BLOCK_*sum.

Tempting but I don't like the smell of this. I believe in the strong principle of keeping the base function the default method. I don't think the R/CRAN/Bioconductor democracy can work if package developers start feeling that they know what THE default method should do and thus take possession of it. What would prevent other packages from doing the same thing? This would quickly become messy. In other words, we don't own base::rowsum() so we need to be careful about these things.

That being said, I'm not saying that you won't find places in the S4Vectors/IRanges/GenomicRanges stack or in one of my other packages where I'm doing exactly what I'm preaching against (either because I didn't know what I was doing at the time or because it was introduced by someone else ;-) )

LTLA commented 4 years ago

Hm. Sounds like I should just keep on using my .colsum until a general solution becomes available.

hpages commented 4 years ago

Or you can wrap whatever matrix-like object you have inside a DelayedArray shell and pass it to colsum().