bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

Simple Operations on matrices: max, min, range etc. #142

Open Artur-man opened 1 month ago

Artur-man commented 1 month ago

Hi,

Would it makes sense to define some easy operations on IterableMatrix, MatrixSubset etc. ? and do you guys believe that there is an easy alternative I can use until these are (if planned) implemented ?

> mat <- matrix(1:20, nrow = 2, ncol = 10)
> max(mat)
[1] 20
> min(mat)
[1] 1
> range(mat)
[1]  1 20
> mat <- as(mat, "CsparseMatrix")
> max(mat)
[1] 20
> min(mat)
[1] 1
> range(mat)
[1]  1 20
Artur-man commented 1 month ago

For now I convert them to in memory sparse matrices quick, where I only need a single column of a large matrix for my project, and it is not that slow:

getMin <- function(data, ...){
  if(inherits(data, "IterableMatrix")){
    data <- as(data, "dgCMatrix")
  } 
  return(min(data, ...))
}

getMax <- function(data, ...){
  if(inherits(data, "IterableMatrix")){
    data <- as(data, "dgCMatrix")
  } 
  return(max(data, ...))
}
bnprks commented 1 month ago

Hi @Artur-man, we have some operations along these lines which implement a subset of the widely-used matrixStats interfaces. We don't currently have a global min or max function, but we do have colMaxs() and rowMaxs() which were added by @immanuelazn in PR 103. A few others are included in this documentation list. We will also have rowQuantiles() and colQuantiles() soon once PR 128 is merged.

A slightly hacky but more efficient way to define your operations would be max <- function(mat) max(colMaxs(mat)) and min <- function(mat) -1*max(colMaxs(-1*mat)). A proper implementation to go in BPCells core would probably define short standalone functions in C++ that are then exposed to R similar to #103.

Conversion to dgCMatrix is not ideal, because converting to it causes crashes for matrices with more than 2^31 non-zero entries, which starts to become an issue around 1M cell matrices for RNA-seq datasets with 2k genes detected per cell. The suggestions I give above won't have those issue as the colMaxs() core implementation uses the efficient internal C++ APIs.

We're happy to take PRs for more of these utility operations, including min() or max() if you're interested. It's a little finicky to maintain compatibility with the matrixStats interface, but PR #103 is a good model of what's required. Happy to discuss more about the details if there's an operation you'd like to contribute. It can be a good area for new contributors to BPCells.

-Ben

immanuelazn commented 1 month ago

Just to add on to what Ben had mentioned, BPCells also exposes methods to apply R functions to arrays, to each col/row of an IterableMatrix (see apply_by_row/col). While there is a small performance hit of the matrix iteration not all being done at the C++ level, it should still be much faster than switching over to a dgCMatrix.

In the case of running max, you could do the following:

# initializing a dummy IterableMatrix
mat <- matrix(1:20, nrow = 2, ncol = 10) %>% as("dgCMatrix") %>% as("IterableMatrix")
# results are now as a list, one item for every column
res_list <- apply_by_col(mat, min)
# unlist and aggregate
res <- min(unlist(res_list))
Artur-man commented 1 month ago

ah thanks so much guys trying it now!!!