Open Artur-man opened 1 month ago
For now I convert them to in memory sparse matrices quick, where I only need a single column of a large matrix for my project, and it is not that slow:
getMin <- function(data, ...){
if(inherits(data, "IterableMatrix")){
data <- as(data, "dgCMatrix")
}
return(min(data, ...))
}
getMax <- function(data, ...){
if(inherits(data, "IterableMatrix")){
data <- as(data, "dgCMatrix")
}
return(max(data, ...))
}
Hi @Artur-man, we have some operations along these lines which implement a subset of the widely-used matrixStats
interfaces. We don't currently have a global min
or max
function, but we do have colMaxs()
and rowMaxs()
which were added by @immanuelazn in PR 103. A few others are included in this documentation list. We will also have rowQuantiles()
and colQuantiles()
soon once PR 128 is merged.
A slightly hacky but more efficient way to define your operations would be max <- function(mat) max(colMaxs(mat))
and min <- function(mat) -1*max(colMaxs(-1*mat))
. A proper implementation to go in BPCells core would probably define short standalone functions in C++ that are then exposed to R similar to #103.
Conversion to dgCMatrix is not ideal, because converting to it causes crashes for matrices with more than 2^31 non-zero entries, which starts to become an issue around 1M cell matrices for RNA-seq datasets with 2k genes detected per cell. The suggestions I give above won't have those issue as the colMaxs()
core implementation uses the efficient internal C++ APIs.
We're happy to take PRs for more of these utility operations, including min()
or max()
if you're interested. It's a little finicky to maintain compatibility with the matrixStats
interface, but PR #103 is a good model of what's required. Happy to discuss more about the details if there's an operation you'd like to contribute. It can be a good area for new contributors to BPCells.
-Ben
Just to add on to what Ben had mentioned, BPCells also exposes methods to apply R functions to arrays, to each col/row of an IterableMatrix
(see apply_by_row/col
). While there is a small performance hit of the matrix iteration not all being done at the C++ level, it should still be much faster than switching over to a dgCMatrix
.
In the case of running max, you could do the following:
# initializing a dummy IterableMatrix
mat <- matrix(1:20, nrow = 2, ncol = 10) %>% as("dgCMatrix") %>% as("IterableMatrix")
# results are now as a list, one item for every column
res_list <- apply_by_col(mat, min)
# unlist and aggregate
res <- min(unlist(res_list))
ah thanks so much guys trying it now!!!
Hi,
Would it makes sense to define some easy operations on
IterableMatrix
,MatrixSubset
etc. ? and do you guys believe that there is an easy alternative I can use until these are (if planned) implemented ?