Bioconductor / MatrixGenerics

S4 Generic Summary Statistic Functions that Operate on Matrix-Like Objects
https://bioconductor.org/packages/MatrixGenerics
12 stars 1 forks source link

Generics for a sparseMatrixStats package? #4

Closed const-ae closed 3 years ago

const-ae commented 4 years ago

Hi,

I have recently started to implement the matrixStats API for column sparse matrices (dgCMatrix) from the Matrix package. I saw the discussion on the BioC mailing list where to put the S4 generics for those functions and wondered what the current status of the discussion was?

(I am not quite sure if this is the right venue for this question, or if should rather post to the Bioc Mailing list, but I would appreciate any input :) )

Best Regards, Constantin

LTLA commented 4 years ago

I would say that the best short-term solution is to PR into DelayedMatrixStats. DMS is already used by a range of packages, so any improvements to performance will have an immediate impact without requiring downstream changes. By comparison, if you made your own package, we would have to figure out where the generics are defined, and then modify downstream packages to use them, etc.

The best long-term solution is to shift all generic definitions into Matrix, and each package that defines its own matrix class also defines methods for those generics. If that comes to pass, DMS as a package would dissolve completely with methods for DAs going back into DelayedArray and methods for sparse matrices falling into Matrix itself. However, this involves a lot more buy-in from a lot more people so the wheels will turn slowly.

I think all the affected parties are already watching this so I won't bother tagging specific individuals.

hpages commented 4 years ago

Hi Constantin,

A sparseMatrixStats package will be extremely useful! I would say that the easiest way for you to go is to define the matrixStats S4 generics in your package, like you've already started to do. Once your package is on CRAN or Bioconductor, it will be easy to modify DelayedMatrixStats to import the generics from sparseMatrixStats. Then we'll be able to take advantage of the methods you've implemented for dgCMatrix objects to boost block-processing of sparse DelayedMatrix objects.

H.

const-ae commented 4 years ago

Hi, thank you for the feedback. I get the impression that the question of the generics for matrixStats is still openly debated, so I will for now follow Hervé's advise and keep the generics defined in my package. @LTLA, I am not convinced that it is a good idea to merge the sparseMatrixStats functions into the DMS package, because I think that it would be confusing to have this functionality under a completely different name. But, I am open to discuss what package the best location for the generics would be (Matrix, MatrixGenerics, DelayedMatrixStats, sparseMatrixStats, ...).

Best Regards, Constantin

PeteHaitch commented 4 years ago

I think a dedicated sparseMatrixStats is simpler than making it a part of DelayedMatrixStats.

I'd be happy to DelayedMatrixStats to disappear if its functionality can be absorbed by more 'key' packages.

LTLA commented 4 years ago

Well, someone has to take responsibility for defining ANY methods for the generics. Without those, I can't use unwrapped versions of the functions in my packages. (Currently they are wrapped as, e.g., colVars(DelayedArray(X)).) And if they're not unwrapped, I lose any benefit of a specialized method, unless DMS is smart enough to dispatch on the seed's class if the object is pristine.

And the ANY method better not call as.matrix(), this would not go down well with other non-DA file backed matrices.

hpages commented 4 years ago

@LTLA Agreed.

@const-ae I think that if sparseMatrixStats imports matrixStats, then a minimalist setGeneric statement of the form setGeneric("rowMedians", signature="x") automatically sets matrixStats::rowMedians() as the ANY method. So that's really all what the setGeneric statements in sparseMatrixStats should be. Thanks!

const-ae commented 4 years ago

I am not sure yet that it is a good idea to just take turn the matrixStats functions into generics for two reasons:

First, the method signatures contain arguments that are purely for speed considerations and not every extension would want to implement (see for example the .dim argument).

Second, matrixStats can only handle matrix or vector. In that sense the ANY signature would be misleading. I think there is some value in having a wrapper in the ANY method, but @LTLA is of course right that just calling as.matrix() would not be a good idea for file backed matrices. But I think we could get around that problem by having a custom overloaded wrapper that can turn for example a non-DA file backed matrix into a DelayedMatrix or a dgTMatrix into a dgCMatrix.


# Code for the MatrixGenerics package

library(Matrix)

wrap_for_matrixstats <- function(x) UseMethod("wrap_for_matrixstats", x)

wrap_for_matrixstats.default <- function(x){
  print("Turn argument to matrix")
  tmp <- as.matrix(x)
  class(tmp) <- "matrix"
  tmp
}

setGeneric("rowMedians", signature = "x", function(x, na.rm=FALSE, ...){
  standardGeneric("rowMedians")
})
#> [1] "rowMedians"

setMethod("rowMedians", signature = c(x = "ANY"), function(x, na.rm = FALSE, ...){
  print("In the any")
  rowMedians(wrap_for_matrixstats(x), na.rm = na.rm, ...)
})

setMethod("rowMedians", signature = c(x = "matrix"),
          function(x, na.rm = FALSE, rows = NULL, cols = NULL, dim. = dim(x), ...){
            matrixStats::rowMedians(x, rows = rows, cols = cols, na.rm = na.rm, dim. = dim., ...)
          })

# Code for user
mat <- matrix(1:30, nrow=6, ncol=5)
rowMedians(mat)
#> [1] 13 14 15 16 17 18

new_mat <- mat
class(new_mat) <- "MyFancyNewClass"
rowMedians(new_mat)
#> [1] "In the any"
#> [1] "Turn argument to matrix"
#> [1] 13 14 15 16 17 18

# Code for sparseMatrixStats

setMethod("rowMedians", signature = c(x = "dgCMatrix"),
          function(x, na.rm = FALSE, ...){
            print("Do calculation optimized for sparse matrices")
          })
sp_mat <- as(mat, "dgCMatrix")
rowMedians(sp_mat)
#> [1] "Do calculation optimized for sparse matrices"

# Easy to extend to other sparse classes
different_sp_mat <- as(sp_mat, "dgTMatrix")
rowMedians(different_sp_mat)
#> [1] "In the any"
#> [1] "Turn argument to matrix"
#> [1] 13 14 15 16 17 18
wrap_for_matrixstats.dgTMatrix <- function(x){
  as(x, "dgCMatrix")
}
rowMedians(different_sp_mat)
#> [1] "In the any"
#> [1] "Do calculation optimized for sparse matrices"

Created on 2019-10-03 by the reprex package (v0.3.0)

This would mean that any new kind of matrix class would only have to implement the wrap_for_matrixstats() function to convert to a type for which an optimized implementation exists.

hpages commented 4 years ago

I agree that sometimes not all the arguments in the original function make sense for the generic. This can be handled with something like:

.default_rowProds <- function (x, na.rm=FALSE, method=c("direct",  "expSumLog"), ...)
{
    method <- match.arg(method)
    dots <- list(...)
    matrixStats::rowProds(x, rows=dots$rows, cols=dots$cols, na.rm=na.rm, method=method, ...)
}

setGeneric("rowProds", signature="x",
    function(x, na.rm=FALSE, method=c("direct",  "expSumLog"), ...) standardGeneric("rowProds"),
    useAsDefault=.default_rowProds
)

Note that this is equivalent to:

setGeneric("rowProds", signature="x",
    function(x, na.rm=FALSE, method=c("direct",  "expSumLog"), ...) standardGeneric("rowProds"))
setMethod("rowProds", "ANY", .default_rowProds)

About the default methods not supporting any object

Short answer: It's important to realize that, whatever you do, the default methods won't be able to support any object (many objects in R simply can't be turned into matrix-like objects).

Long answer: We are dealing with a situation where a bunch of functions (the functions defined in matrixStats) have been around and used for a while. All we want to do is promote these functions to S4 generics in a way that is as transparent and as backward compatible as possible. The standard/safe way to do this is to have the ANY method for these generics be the original function. Sometimes, like in the case where not all the arguments in the original function make it to the generic, it's not possible. So we have to come up with an ANY method that is a thin wrapper around the original function. Note that this thin wrapper doesn't alter the semantic of the original function in any way. In particular I think it's important to not mask the original function with an ANY method that tries to alter object x before calling the original function. Note that this situation is no different from what we do in BiocGenerics where we promote base R functions like unlist(), append(), table(), etc... to S4 generics. The setGeneric() statements we use for this are minimalist, and, if we need to drop some arguments in the generic function (e.g. for table()), we make the default method a wrapper that is as thin as possible and preserves the original semantic.

Hope this makes sense.

LTLA commented 4 years ago

And there's the rub.

  1. An ANY method must be available, otherwise downstream packages cannot safely use the generic with any object that claims to be a matrix-like.
  2. We can't coerce to a sparse matrix, because there is no guarantee that the input is sparse.
  3. Coercion to a DelayedArray requires a dependency on DMS. Difficult to do from CRAN.

If SMS is its own package, the least-worst solution is - unfortunately - to call as.matrix. Then at least downstream software would work with other matrices, albeit inefficiently.

More generally, I'm concerned about the fragmentation of "matrixStats" methods. If someone comes up with another matrix class in package A, and puts their *MatrixStats methods in package B, how do I make sure DMS-downstream packages use those methods? It seems that, if DMS isn't aware of B (and why should it be?), the user would be obliged to load in B themselves - not good.

Edit: We commented at the same time, so in response to Herve: the minimalistic nature of the BiocGenerics generics is fine because they only expect to work with Bioconductor objects. But if you have a colVars generic somewhere, I would expect that it would work with any matrix-like object.

hpages commented 4 years ago

Also this works for BiocGenerics generics because a specific method for object A is generally defined in the same package where class A itself is defined. So when you call foo() on an object of class A, it's very likely that the package where the foo() method for A objects is defined is already loaded. Thus you're right that in that aspect the situation with Matrix/matrixStats/sparseMatrixStats is different.

However, I do think that it's DelayedArray's and DelayedMatrixStats' job to keep track of relevant matrix representations that are likely to be used as DelayedArray object seeds. So at least DMS-downstream packages won't need to do anything and will benefit out-of-the-box from any optimized method.

But if you have a colVars generic somewhere, I would expect that it would work with any matrix-like object.

Even though I can see arguments in favor of this approach, as you mentioned it will have the downside that colVars() will be very very inefficient in some cases (at best, and in the worst case it will hang your system). I think I'd rather have it fail with an informative error message. Anyway, that should only be a concern for packages that don't import DMS.

const-ae commented 4 years ago

First of all, thank you both for taking the time to discuss and explain the issues to me.

In the following, I will try to summarize the different options that I understood so far, using an example to make the options clearer:

Let's imagine there is a Bioinformatician who wants to analyze some dataset with rows and columns using the SingleCellAnalysisPackage (SCAP). We develop SCAP and want to make sure it uses the most appropriate rowMeans()/rowVars() functions for the dataset. The dataset could be a standard matrix, a dgCMatrix, a DelayedMatrix, or even the yet unknown type NewMatrixType. The analysis consists of two steps: first the user loads the dataset representation into memory and then calls our SCAP functions.

To make sure that the best function for the dataset is called:

(1) the bioinformatician can explicitly load the correct MatrixGenerics implementation for her dataset (DelayedMatrixStats, sparseMatrixStats etc.). SCAP relies that the right implementation is available. (2) The function that loads the dataset makes sure that the corresponding MatrixGenerics implementation for the dataset is loaded. SCAP relies that the right implementation is available. (3) SCAP checks the input and explicitly calls the correct implementation. (4) SCAP converts all input to a common type, e.g. DelayedMatrix. (5) SCAP loads some MatrixGenerics implementations (e.g. DelayedMatrixStats, sparseMatrixStats) and calls the generic rowVars()which dispatches to the loaded MatrixGenerics implementations or fails for everything that is not a matrix. (6) Like (5), but instead of failing converts unknown input with as.matrix() which dispatches to matrixStats. (7) Like (5), but instead of failing converts unknown input with wrap_for_matrixstats() which dispatches to matrixStats, DelayedMatrixStats, or sparseMatrixStats depending on the return type of wrap_for_matrixstats().

If I understood correctly, @LTLA is currently using the pattern (4) in his packages, but sees pattern (6) as the best option. @hpages is in favor of pattern (5) because the user is explictly told how to change her input. I prefer pattern (7) because it makes it very easy to convert for example a dgTMatrix to a dgCMatrix without having to re-implement all 60 methods. (If I have misrepresented your positions, I am very sorry and please just post the correction.)

LTLA commented 4 years ago

Not sure what 3 is meant to be; if we could do that, we wouldn't have any problems...

Anyway, 7 just passes the buck to another level of S4 dispatch. It's good to use that within sMS to handle all Matrix classes without rewriting a lot of code, but it doesn't solve our NewMatrixType problem, which fundamentally involves resolving inter-package dependencies.

My personal S4 policy is to only define new methods for classes in packages where I either define the class or the generic. This ensures that we always have the method defined if a package is using a particular class with a particular generic. Thus, to me, the best solution would be:

  1. Move sMS method definitions to Matrix, along with the generics.
  2. Move DMS method definitions to DelayedArray.
  3. Downstream packages just load generics from Matrix.

@mmaechler would probably have some opinions on the viability of 1. But in the meantime, the next-best solution would be:

  1. Define a MatrixGenerics package that guarantees support for all Matrix classes, sparse or otherwise. Doesn't matter if you have to as.matrix them for now.
  2. Move DMS method definitions to DelayedArray.
  3. Downstream packages just load generics from MatrixGenerics.

And then mandate that, if you define your own matrix class, you had better define any methods in the same place, otherwise they won't get used by downstream packages. The key point here is to avoid people from defining a plethora of *MatrixStats packages that are distinct from the packages containing the class definitions - this is where we seem to be headed right now.

const-ae commented 4 years ago

Anyway, 7 just passes the buck to another level of S4 dispatch. It's good to use that within sMS to handle all Matrix classes without rewriting a lot of code, but it doesn't solve our NewMatrixType problem, which fundamentally involves resolving inter-package dependencies.

You are right, it wouldn't solve the inter-package dependencies problem, but it might make it a little bit more likely that the developer of a NewMatrixType package would just implement the one wrap_for_matrixstats() generic instead of having to implement all 69 functions of the matrixStats interface.

My personal S4 policy is to only define new methods for classes in packages where I either define the class or the generic.

Oh, okay. Yes that seems to be a good heuristic, which I will adopt for myself in the future. But I am quite doubtful that @mmaechler will include sMS in a core package :D

So, I agree with your "next best" solution. @PeteHaitch what do you think about reviving this repo and starting to implement the generic function definitions that can be used by both DMS / DelayedArray and sparseMatrixStats? If you want I could also go ahead and draft a PR.

PeteHaitch commented 4 years ago

Please do go ahead, @const-ae!

hpages commented 4 years ago

I'm still not sure I like that wrap_for_matrixstats(x) returns as.matrix(x) by default. First because if a user calls rowVars() on a DelayedMatrix object or derivative (e.g. TENxMatrix) but doesn't have the DelayedMatrixStats package on the search path, I'd rather have this fail rather than hang her system. Also because, even though as.matrix() works on things like an ordinary array, it does not work in a meaningful way in the context of matrix/array row/col summarization. For example base::rowMeans() works on arrays of at least two dimensions so some users might expect rowVars() to also work on that. However, instead of failing, it will work and return something wrong.

I understand that this could be mitigated by defining a wrap_for_matrixstats.array() method in the MatrixStats package but in the case of a DelayedMatrix object defining a wrap_for_matrixstats.DelayedMatrix() method wouldn't really help (it wouldn't make sense to define it in MatrixStats so we're back to the problem that the user doesn't necessarily have the package where this method is defined in her search path).

I think these problems are better handled by not providing wrap_for_matrixstats.default() or by defining it as a no-op. The only difference between the 2 solutions is where the error will occur: in the wrap_for_matrixstats() generic if no wrap_for_matrixstats.default() is provided:

a <- array(1:24, 4:2)
rowVars(a)
Error in UseMethod("wrap_for_matrixstats") : 
  no applicable method for 'wrap_for_matrixstats' applied to an object of class "c('array', 'integer', 'numeric')"

> rowVars(tenx)  # on a TENxMatrix object
Error in UseMethod("wrap_for_matrixstats") : 
  no applicable method for 'wrap_for_matrixstats' applied to an object of class "c('TENxMatrix', 'DelayedMatrix', 'DelayedArray', 'DelayedUnaryIsoOp', 'DelayedUnaryOp', 'DelayedOp', 'Array', 'DataTable', 'DataTable_OR_NULL')"

or in matrixStats::rowVars() if wrap_for_matrixstats.default() is provided as a no-op:

> rowVars(a)
Error in rowVars(a) : 
  Argument 'dim' must be an integer vector of length two.

> rowVars(tenx)
Error in rowVars(tenx) : Argument 'x' must be a matrix or a vector.

I don't have a strong preference but I do prefer that to coercion to matrix by default.

mmaechler commented 4 years ago

hmmm, as said earlier, I think it's a good idea in principle and I am "conceptually" willing to add things to Matrix for that purpose. Can you propose patches (as small as possible !!) ? (including NAMESPACE and man/*.Rd additions) ? Matrix is developed on R-forge (with svn / subversion), but you can take a svn2git version of its source if you want:

     https://r-forge.r-project.org/scm/?group_id=61

To get the svn'ed current version, from the webpage above, use

svn checkout svn://svn.r-forge.r-project.org/svnroot/matrix/ Matrix

Best, Martin

On Fri, Oct 4, 2019 at 6:08 PM Aaron Lun notifications@github.com wrote:

Not sure what 3 is meant to be; if we could do that, we wouldn't have any problems...

Anyway, 7 just passes the buck to another level of S4 dispatch. It's good to use that within sMS to handle all Matrix classes without rewriting a lot of code, but it doesn't solve our NewMatrixType problem, which fundamentally involves resolving inter-package dependencies.

My personal S4 policy is to only define new methods for classes in packages where I either define the class or the generic. This ensures that we always have the method defined if a package is using a particular class with a particular generic. Thus, to me, the best solution would be:

  1. Move sMS method definitions to Matrix, along with the generics.
  2. Move DMS method definitions to DelayedArray.
  3. Downstream packages just load generics from Matrix.

@mmaechler would probably have some opinions on the viability of 1. But in the meantime, the next-best solution would be:

  1. Define a MatrixGenerics package that guarantees support for all Matrix classes, sparse or otherwise. Doesn't matter if you have to as.matrix them for now.
  2. Move DMS method definitions to DelayedArray.
  3. Downstream packages just load generics from MatrixGenerics.

And then mandate that, if you define your own matrix class, you had better define any methods in the same place, otherwise they won't get used by downstream packages. The key point here is to avoid people from defining a plethora of *MatrixStats packages that are distinct from the packages containing the class definitions - this is where we seem to be headed right now.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/Bioconductor/MatrixGenerics/issues/4#issuecomment-538461322 [[alternative HTML version deleted]]

-- Martin Maechler@stat.math.ethz.ch https://stat.ethz.ch/~maechler Seminar für Statistik, ETH Zürich HG G 16 Rämistrasse 101 CH-8092 Zurich, SWITZERLAND phone: +41-44-632-3408 fax: ...-1228 <><

LTLA commented 4 years ago

I got the party started here - mostly to tackle the "double p" transition that I was discussing with @mmaechler offline - but others may benefit from forking this rather than repeating the git-svn transition themselves (it took about half an hour on my machine).

const-ae commented 4 years ago

I don't have a strong preference but I do prefer that to coercion to matrix by default.

@hpages, you make some good points and I think I now agree that an informative error message is more helpful for the user than some code that runs painfully slow or just randomly crashes because R runs out of memory.

const-ae commented 4 years ago

hmmm, as said earlier, I think it's a good idea in principle and I am "conceptually" willing to add things to Matrix for that purpose.

@mmaechler, that is great to hear, thank you for the input. Just to clarify what extend of changes you would be "conceptually" willing to accept?

mmaechler commented 4 years ago

hmmm, as said earlier, I think it's a good idea in principle and I am "conceptually" willing to add things to Matrix for that purpose.

@mmaechler, that is great to hear, thank you for the input. Just to clarify what extend of changes you would be "conceptually" willing to accept?

* The definition of the generics corresponding to the `matrixStats` API?

yes, if they don't give contradictions with current Matrix and base code.

* An implementation for those generics for the sparse matrices?

Here you mean methods , right? Yes, it should make sense to have methods for some of the new generics and the Matrix classes.

* An implementation for those generics along the lines I have already [drafted](https://github.com/const-ae/sparseMatrixStats) using C++ and Rcpp?

No, sorry. Matrix is so close to base R, I cannot deal with the consequences of making it depend on Rcpp. We could consider Matrix (pkg) methods which use a package in 'Suggests' and try to use that if it's available and bail out if not. But I'm also quite uneasy about such a setup. As long as I'm (now) practically the only author and maintainer of Matrix, I don't want to add stuff that make maintenance considerably more cumbersome. Be aware that even now some 'stats' pacakge function optionally load Matrix in order to use sparse matrices... so Matrix needs to stay very stable.

const-ae commented 4 years ago

As long as I'm (now) practically the only author and maintainer of Matrix, I don't want to add stuff that make maintenance considerably more cumbersome.

I very much understand and agree that with such a fundamental packages it is reasonable to be conservative.

I did have a look at the colSums() implementation in Matrix and unfortunately I am lacking the depth of knowledge in C to implement something similar for the other functions. Thus I am afraid, that I won't be able to contribute the method implementations to the Matrix package. If anyone else would be willing to do this, that would of course be great. Otherwise, I would just continue the implementation in my sparseMatrixStats package and would be in favor of putting the S4 generics for the matrixStats API in the MatrixGenerics repo. Or do you (@LTLA, @hpages, @PeteHaitch, @mmaechler) think that it would still make sense to have the generics in Matrix?

mmaechler commented 4 years ago

I'm not entirely happy with your conclusion @const-ae. Yes, it would've been great if you'd be willing to learn the relatively litte more C needed to these in C rather than via the Rcpp detour (which for the programmer is a shortcut, of course!). But I did mention the other alternative of providing those methods differently, namely via simple R code wrappers that may call into your package code if that's present. Or even simple pure R methods which work, are easy to maintain but just slower than compiled ones. Correct functionality is much more important than speed, and so first having correct methods (plus tests for them), and then add "faster method implementations" on a TODO list, e.g. for people who like to contribute is a very good strategy.

IIRC one goal has been to make life much easier for useRs and developers of other (Matrix / matrixStats / sparseMatrixStats using) packages all of who would know where to import the generics from (namely 'Matrix').

LTLA commented 4 years ago

@const-ae I think you should just start by picking one generic - say, colVars or rowVars, which is immediately useful to a lot of people - and try to implement it for dgCMatrixs. I have worked directly with the C API before (PROTECT and UNPROTECT, registering routines, etc.) so I can give some help with writing it up in C; and so can most other participants in this discussion. This should give us a good idea of the feasibility/strategy of proceeding to the other generics.

I mean, like, C is just C++, but -1, so how hard can it be? :) R_alloc even handles freeing for you.

hpages commented 4 years ago

Here is my take on this: https://github.com/hpages/sparseMatrixColVars/blob/master/src/colVars.c

No need to alloc anything besides the vector to return.

const-ae commented 4 years ago

Thank you all again for your input.

I did give it a try and wrote for the colVars() function the generic and some pure R methods that call matrixStats if it is available: https://github.com/const-ae/Matrix/blob/master/pkg/Matrix/R/colVars.R

However, I am struggling right now a bit to actually rebuild the package. Is there an equivalent for devtools::load_all() available? Or what kind of workflow do you follow when you develop the package?

@hpages, thanks for the code sample. That does indeed look much more readable and something I could do myself. I think I just got intimidated by all the macros in the existing t_gCMatrix_colSums.c file.

PeteHaitch commented 3 years ago

Closing now that we have sparseMatrixStats.