Closed LTLA closed 4 years ago
This should not be necessary. DelayedArray:::.BLOCK_rowsum()
and DelayedArray:::.BLOCK_colsum()
are fast and work on anything that satisfies the "seed contract" (i.e. supports dim()
, dimnames()
, and extract_array()
). That includes sparse matrices:
library(Matrix)
library(DelayedArray)
rowsum_ANY <- function(x, group, reorder=TRUE, ...)
{
by.group <- split(seq_len(nrow(x)), group)
out <- lapply(by.group, function(i) colSums(x[i,,drop=FALSE]))
out <- do.call(rbind, out)
if (!reorder) {
out <- out[unique(group),,drop=FALSE]
}
out
}
m <- matrix(runif(1.2e8), nrow=12000)
m0 <- as(ifelse(m <= 0.05, m, 0), "dgCMatrix") # 5% non-zero values
group <- sample(777, nrow(m0), replace=TRUE)
system.time(rs1 <- DelayedArray:::.BLOCK_rowsum(m0, group))
# user system elapsed
# 0.884 0.927 0.941
system.time(rs2 <- rowsum_ANY(m0, group))
# user system elapsed
# 12.968 0.040 13.008
identical(rs1, rs2)
# [1] TRUE
So either you wrap your sparse matrix in a DelayedArray object and call rowsum()
on it (which is equivalent to calling DelayedArray:::.BLOCK_rowsum()
directly on the sparse matrix):
system.time(rs3 <- rowsum(DelayedArray(m0), group))
# user system elapsed
# 0.979 0.822 0.950
identical(rs1, rs3)
# [1] TRUE
or, if that's not convenient enough, we could just temporarily add the 2 following lines to DelayedArray (until the sparseMatrixStats package provides a native implementation for this):
setMethod("rowsum", "sparseMatrix", .BLOCK_rowsum)
setMethod("colsum", "sparseMatrix", .BLOCK_colsum)
Oh, good.
or, if that's not convenient enough, we could just temporarily add the 2 following lines to DelayedArray (until the sparseMatrixStats package provides a native implementation for this):
Well, if it's going to be a stopgap, I would like a fallback ANY
method that works for any matrix input. Seems like the current ANY
method should really be the matrix
method, and the true ANY
method would actually call .BLOCK_*sum
.
Seems like the current ANY method should really be the matrix method, and the true ANY method would actually call .BLOCK_*sum.
Tempting but I don't like the smell of this. I believe in the strong principle of keeping the base function the default method. I don't think the R/CRAN/Bioconductor democracy can work if package developers start feeling that they know what THE default method should do and thus take possession of it. What would prevent other packages from doing the same thing? This would quickly become messy. In other words, we don't own base::rowsum()
so we need to be careful about these things.
That being said, I'm not saying that you won't find places in the S4Vectors/IRanges/GenomicRanges stack or in one of my other packages where I'm doing exactly what I'm preaching against (either because I didn't know what I was doing at the time or because it was introduced by someone else ;-) )
Hm. Sounds like I should just keep on using my .colsum
until a general solution becomes available.
Or you can wrap whatever matrix-like object you have inside a DelayedArray shell and pass it to colsum()
.
Pretty much as it says. The aim is to provide some slow inefficient fallbacks for these generics so that they can operate on sparse matrices (or anything with a
colSums
androwSums
method).Perhaps these won't live in DelayedArray in the long term, but this is a good enough location for the time being; it allows me to actually use these generics in my various packages. Right now I'm wrapping
colsum
in another scater generic (.colsum
) in order to catch the non-DA/non-base case.