Bioconductor / MatrixGenerics

S4 Generic Summary Statistic Functions that Operate on Matrix-Like Objects
https://bioconductor.org/packages/MatrixGenerics
12 stars 1 forks source link

Implement alternative generics for row/colMedians() #5

Closed const-ae closed 4 years ago

const-ae commented 4 years ago

Hi Peter,

I finally got around to write down my alternative proposal for the matrix stats generics. After the idea to add the generics directly to the Matrix package has stalled, I went ahead and implemented the generics here based on the discussion in https://github.com/Bioconductor/MatrixGenerics/issues/4.

To summarize, I made the following design decisions:

If you like the proposal, I would go ahead and implement the other functions in the same style.

hpages commented 4 years ago

Hi Constantin,

Note that we need to have MatrixGenerics depend on matrixStats otherwise the generics will get masked if someone loads matrixStats after loading MatrixGenerics:

library(MatrixGenerics)

rowMedians
# standardGeneric for "rowMedians" defined from package "MatrixGenerics"
#
# function (x, na.rm = FALSE, ...) 
# standardGeneric("rowMedians")
# <bytecode: 0x2bacbd0>
# <environment: 0x2b9e9b8>
# Methods may be defined for arguments: x
# Use  showMethods("rowMedians")  for currently available ones.

library(matrixStats)
#
# Attaching package: ‘matrixStats’
#
# The following objects are masked from ‘package:MatrixGenerics’:
#
#   colMedians, colWeightedMedians, rowMedians, rowWeightedMedians

rowMedians
# function (x, rows = NULL, cols = NULL, na.rm = FALSE, dim. = dim(x), ...) 
# {
#     dim. <- as.integer(dim.)
#     na.rm <- as.logical(na.rm)
#     has_nas <- TRUE
#     .Call(C_rowMedians, x, dim., rows, cols, na.rm, has_nas, 
#         TRUE)
# }
# <bytecode: 0x3169180>
# <environment: namespace:matrixStats>

Then the rowMeans() method for matrix objects simply becomes:

setMethod("rowMedians", "matrix",
    function(x, na.rm=FALSE, ...) matrixStats::rowMedians(x, na.rm=na.rm, ...)
)

Also there is no need for a default method that issues an error saying basically the same thing than the built-in error one gets when dispatch cannot find a method for the supplied object:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘rowMeans’ for signature ‘"dgCMatrix"’

Thanks for your work on this.

H.

PeteHaitch commented 4 years ago

Thanks for your work on this, Constantin!

I removed the parameters rows, cols, and .dim from the generic. To me, those seem to be specific performance optimizations provided by matrixStats, rather than essential parameters that every possible downstream implementation should provide.

I agree that aren't essential but I guess I'd like to encourage that methods implement these.

const-ae commented 4 years ago

Thank you both for your feedback.

@hpages, I added matrixStats to Import instead of Suggests in the DESCRIPTION file.

@PeteHaitch, I have added the rows and cols argument to the generic signature. I have also added an additional implementation if x is a vector (which I previously forgot) that can handle the dim. argument.

PeteHaitch commented 4 years ago

@const-ae Sorry I didn't explain myself well. I think not having them in the generic is okay (for the reason you mention) but would like to aim to implement methods (e.g., those for DelayedArray and HDF5Array) to support these

hpages commented 4 years ago

Hi Pete,

Thanks for clarifying.

Note that since [ is delayed for DelayedArray and HDF5Array objects, something like rowMedians(x[rows, cols]) would probably be good enough from a performance point of view. Actually if the rowMedians() methods for DelayedArray or HDF5Array objects were to support the rows and cols args, they would probably do just that. It's not immediately obvious to me how they could do something much smarter than that, at least in the general case.

H.

const-ae commented 4 years ago

Thanks for the clarification Pete. However, now I am hessitant to change back and remove rows and cols from the generic signature. In particular because I am not sure what is the best way to handle the order of the method arguments without them. In matrixStats the order usually is x, rows, cols, na.rm (if applicable), dims., and the ellipsis (...).

My initial suggestion basically was:

setGeneric("rowMedians", function(x, na.rm=FALSE, ...) standardGeneric("rowMedians"),signature = "x")

This means that the method for x of type matrix would have to be

setMethod("rowMedians", signature = "matrix", function(x, na.rm=FALSE, rows = NULL, cols = NULL, ...){
  matrixStats::rowMedians(x, rows = rows, cols = cols, na.rm=na.rm, ...)
})

if I want to document the existince of the rows and cols arguments.

Note the changed order of arguments. If the arguments had the order of the original matrixStats function, rows would get the argument for the na.rm.

However, this has the problem that this approach breaks existing method calls for matrixStats that don't use named arguments.

In conclusion, I think I cannot avoid having rows and cols as second and third argument for the generic, if I want to include na.rm in the signature (Sorry for the confusion).

hpages commented 4 years ago

Hi Constantin,

I'm not a big fan of having the rows and cols arguments in the generics but I don't have a strong opinion on this. I think the risk of breaking existing code if we don't include them in the generic is probably very low. Would only happen if (1) the code is in a package that doesn't make proper use of NAMESPACE and (2) the code doesn't name the arguments when calling rowMedians(). So would only break code that deserves it ;-)

Anyway, up to you and Pete. H.

const-ae commented 4 years ago

Thanks for the input. I still feel slightly uneasy about not including the rows and cols argument. However, if Hervé says that incompatibility probably won't be a big issue, I am inclined to trust him. I would leave it up to you Pete to decide :)

hpages commented 4 years ago

And if you guys decide to drop rows and cols, it's fine to blame me later if it causes problems so you're covered ;-)

PeteHaitch commented 4 years ago

And if you guys decide to drop rows and cols, it's fine to blame me later if it causes problems so you're covered ;-)

That's an offer too good to pass up ;) Dropping them from the generic but aiming to implement them when possible/sensible sounds good to me.

const-ae commented 4 years ago

I changed the signature again and removed rows and cols from the generic. If we all agree now with the general structure of the generic, I will go ahead now and define the other generics.

hpages commented 4 years ago

A few more things:

What arguments from the generic should go in the methods

There are 2 approaches for the arguments of the methods. Either you stick to the same argument list as in the generic:

setMethod("rowMedians", "matrix",
    function(x, na.rm=FALSE, ...)
        matrixStats::rowMedians(x, na.rm=na.rm, ...)
)

or you explicitly list all the method-specific arguments:

setMethod("rowMedians", "matrix",
    function(x, na.rm=FALSE, rows=NULL, cols=NULL, dim.=dim(x))
        matrixStats::rowMedians(x, na.rm=na.rm, rows=rows, cols=cols, dim.=dim.)
)

Two advantages of the 2nd approach:

Two inconveniences of the 2nd approach:

Anyway, to me, the benefit of catching misspelled arguments outweighs the inconvenience of having a little extra work to do to define the method and of making debugging a little harder.

Default for dim. in method for numeric vectors

With no default value, one gets:

> rowMedians(1:12)
Error in matrixStats::rowMedians(x, na.rm = na.rm, dim. = dim., ...) : 
  argument "dim." is missing, with no default

With a default value of dim(x) (like in matrixStats::rowMedians), one gets:

> rowMedians(1:12)
Error in matrixStats::rowMedians(x, na.rm = na.rm, dim. = dim., ...) : 
  Argument 'dim' must be an integer vector of length two.

This is consistent with matrixStats::rowMedians:

> matrixStats::rowMedians(1:12)
Error in matrixStats::rowMedians(1:12) : 
  Argument 'dim' must be an integer vector of length two.

Alternatively rowMedians() could treat the vector as a 1-column matrix by default (that's what the matrix() constructor does). This is achieved by having dim. set to c(length(x), 1L) by default:

> rowMedians(1:12)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12

The last comma (,) in your setGeneric statement is not needed:

setGeneric("rowMedians", function(x, na.rm=FALSE, ...) standardGeneric("rowMedians"),
            signature="x",
)

Thanks!

const-ae commented 4 years ago

Thank you again @hpages for your input.

Maybe, @HenrikBengtsson can weigh in why he chose the to add ... to every function and if it is safe to leave it out of the method signature for x of type matrix.

For the default of dim. with x of type numeric, I like the error message that complaints about missing dim. better, because it feels misleading to have a default argument for the function that I know will fail at runtime. However, I also don't want to just decide that a vector is equivalent to a matrix with one column, because it differs from the behavior of matrixStats.

const-ae commented 4 years ago

@PeteHaitch, can you confirm that the implementation is okay now?

If so, I would go ahead with the other functions :)

PeteHaitch commented 4 years ago

I'm out office until Monday but will take a look then

hpages commented 4 years ago

Methods that provide an implemention for the S4 for generic are encouraged to handle the rows/cols subsetting.

Can we rephrase with something like:

When implementing a method for their own matrix-like objects, developers are encouraged to support the rows and cols arguments.

Also in:

The S4 methods for x of type matrix or numeric are dispatched to matrixStats::rowMedians

I would just use "are calling". Also for completeness, refer to matrixStats::colMedians:

The S4 methods for x of type matrix or numeric are calling matrixStats::rowMedians (matrixStats::colMedians).

And in "see also" section:

matrixStats::rowMedians() and matrixStats::colMedians() which are used for ‘matrix’ input.

something like:

matrixStats::rowMedians() and matrixStats::colMedians() which are used when the input is a matrix or vector.

Thanks again.

PeteHaitch commented 4 years ago

Now that I've had time to look more closely, I'm again slightly worried that not having rows and cols in the signature changes the argument order compared to that in matrixStats. I feel like I'm going around in circles ...

In DelayedMatrixStats I took the approach of treating the matrixStats function's arguments as the de facto arguments in the generic, with anything else via ..., on the basis of historical precedence. A method needn't make use of all arguments in the signature (it may even report a message that it's ignoring certain arguments) but I think having 'one signature to rule them all' makes life easier for the user.

PeteHaitch commented 4 years ago

A side note, @const-ae you may want to look into roxygen2's templated and inherited params. I make use of these in DelayedMatrixStats to reduce the documentation I have to write

E.g., https://github.com/PeteHaitch/DelayedMatrixStats/blob/master/R/AllGenerics.R#L219 (inherited params) and https://github.com/PeteHaitch/DelayedMatrixStats/blob/master/R/colMedians.R#L41 and https://github.com/PeteHaitch/DelayedMatrixStats/blob/master/man-roxygen/lowercase_x.R (templated params)

const-ae commented 4 years ago

I feel like I'm going around in circles ...

Haha, I feel you :D

A compromise could be to only make arguments after na.rm optional. Ie.

setGeneric("rowMedians", function(x, rows = NULL, cols = NULL, na.rm=FALSE, ...) standardGeneric("rowMedians"),
           signature = "x"
)

setMethod("rowMedians", signature = "matrix", function(x, rows = NULL, cols = NULL, na.rm=FALSE, dim. = dim(x)){
  matrixStats::rowMedians(x, rows = rows, cols = cols, na.rm=na.rm, dim. = dim.)
})

This would mean that new implementations would need to handle rows and cols arguments. However, it can, as you say, always just throw an error. I think, this would be compatible with DelayedMatrixStats because it doesn't change the order of the arguments and doesn't put an undue burden on new packages.

const-ae commented 4 years ago

A side note, @const-ae you may want to look into roxygen2's templated and inherited params.

Thanks, I was aware of the interited params. But I didn't realize how easy it is to create your own templates. This is definitely something we should use here. (However, maybe we should first settle on the signature :))

hpages commented 4 years ago

To me defining generics is also a one-time opportunity to start fresh with a clean signature. Anyway, time to move forward.

Can I suggest that we decide of a small subset of generics (e.g. the dozen or so most frequently used row/col summarization functions in Bioconductor) and try to get a fully working matrixStats/MatrixGenerics/sparseMatrixStats/DelayedMatrixStats stack with this subset, rather than aiming at getting the full set in MatrixGenerics/sparseMatrixStats? This will give us something concrete to play with and it'll be easier to make adjustments if the set is small. Once we are happy with what we have, we'll extend to the full set.

Based on some quick grep-based inspection, the matrixStats operations used by Bioconductor software packages are (looking at the row* functions only):

A good subset for prototyping could be the 7 heavily used ones + rowRanges + the corresponding col* functions. So 16 functions in total (instead of 74).

const-ae commented 4 years ago

I have added the 16 functions that Hervè suggested to MatrixGenerics.

I have also created a new branch for the sparseMatrixStats package that implements those generics: https://github.com/const-ae/sparseMatrixStats/compare/alternative_proposal

const-ae commented 4 years ago

I created a PR for the DelayedMatrixStats package that shows how it could use MatrixGenerics functions: https://github.com/PeteHaitch/DelayedMatrixStats/pull/59

const-ae commented 4 years ago

My impression is that it is pretty straightforward to make sparseMatrixStats and DelayedMatrixStats depend on MatrixGenerics. If you agree, I would go ahead and implement the others functions and make MatrixGenerics ready for submission on Bioconductor/CRAN.

hpages commented 4 years ago

Thanks @const-ae .

A few more things (again, sorry):

const-ae commented 4 years ago

Hi @hpages,

I've added support for array input and also refactored the code to use the .default_rowMedians function.

However, I have kept the explicit method for x of type matrix although I know it is redundant. Nonetheless, I believe that it makes the documentation easier to understand for a newcomer who may not know that matrix is an array.

Could you provide some information why it is necessary to explicitly import matrixStats in the NAMESPACE if I always refer to the functions with the double colon notation?

const-ae commented 4 years ago

Hi,

earlier today, @PeteHaitch gave me write access to the repository. If it is okay for you, I will merge this PR in the next days. In the next weeks, I will try to add the remaining functions following the pattern that is implemented for the existing 16 functions. My goal is to submit the package to Bioconductor in the next month so that DelayedMatrixStats and SparseMatrixStats have enough time to implement the generics in the next Bioconductor release.

Best, Constantin

hpages commented 4 years ago

Yep, sounds good @const-ae

Could you provide some information why it is necessary to explicitly import matrixStats in the NAMESPACE if I always refer to the functions with the double colon notation?

Maybe it's not (I thought R CMD check would issue a warning or a note but apparently not). Oh and I see now that this is what DelayedMatrixStats is doing too. Note that this means that matrixStats doesn't get loaded via a namespace after loading MatrixGenerics:

> library(MatrixGenerics)
> sessionInfo()
R Under development (unstable) (2019-10-30 r77336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.r77336/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r77336/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] MatrixGenerics_0.0.99

loaded via a namespace (and not attached):
[1] compiler_4.0.0

The loading of matrixStats is delayed until it's actually needed:

colMaxs(matrix(11:20, ncol=2))
# [1] 15 20

Now sessionInfo() shows it:

> sessionInfo()
R Under development (unstable) (2019-10-30 r77336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.r77336/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r77336/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] MatrixGenerics_0.0.99

loaded via a namespace (and not attached):
[1] compiler_4.0.0     matrixStats_0.55.0

Another notable difference is that this makes installation of MatrixGenerics a lot more quiet i.e. we no longer see the usual bunch of messages Creating a new generic function for ‘rowMaxs’ in package ‘MatrixGenerics’ for each generic defined in the package that masks a function defined in matrixStats. Maybe that's actually a good thing. Nobody pays attention to these messages anyway (they don't really say anything important or useful).

I believe that it makes the documentation easier to understand for a newcomer who may not know that matrix is an array.

A matrix has always been an array, except in R ;-) So I don't think newcomers would ever imagine that a matrix is not an array. The only people to think so are people who have been using R for too long ;-)

Thanks!