Closed const-ae closed 4 years ago
Hi Constantin,
Note that we need to have MatrixGenerics depend on matrixStats otherwise the generics will get masked if someone loads matrixStats after loading MatrixGenerics:
library(MatrixGenerics)
rowMedians
# standardGeneric for "rowMedians" defined from package "MatrixGenerics"
#
# function (x, na.rm = FALSE, ...)
# standardGeneric("rowMedians")
# <bytecode: 0x2bacbd0>
# <environment: 0x2b9e9b8>
# Methods may be defined for arguments: x
# Use showMethods("rowMedians") for currently available ones.
library(matrixStats)
#
# Attaching package: ‘matrixStats’
#
# The following objects are masked from ‘package:MatrixGenerics’:
#
# colMedians, colWeightedMedians, rowMedians, rowWeightedMedians
rowMedians
# function (x, rows = NULL, cols = NULL, na.rm = FALSE, dim. = dim(x), ...)
# {
# dim. <- as.integer(dim.)
# na.rm <- as.logical(na.rm)
# has_nas <- TRUE
# .Call(C_rowMedians, x, dim., rows, cols, na.rm, has_nas,
# TRUE)
# }
# <bytecode: 0x3169180>
# <environment: namespace:matrixStats>
Then the rowMeans()
method for matrix objects simply becomes:
setMethod("rowMedians", "matrix",
function(x, na.rm=FALSE, ...) matrixStats::rowMedians(x, na.rm=na.rm, ...)
)
Also there is no need for a default method that issues an error saying basically the same thing than the built-in error one gets when dispatch cannot find a method for the supplied object:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘rowMeans’ for signature ‘"dgCMatrix"’
Thanks for your work on this.
H.
Thanks for your work on this, Constantin!
I removed the parameters
rows
,cols
, and.dim
from the generic. To me, those seem to be specific performance optimizations provided by matrixStats, rather than essential parameters that every possible downstream implementation should provide.
I agree that aren't essential but I guess I'd like to encourage that methods implement these.
Thank you both for your feedback.
@hpages, I added matrixStats to Import instead of Suggests in the DESCRIPTION file.
@PeteHaitch, I have added the rows
and cols
argument to the generic signature. I have also added an additional implementation if x is a vector (which I previously forgot) that can handle the dim.
argument.
@const-ae Sorry I didn't explain myself well. I think not having them in the generic is okay (for the reason you mention) but would like to aim to implement methods (e.g., those for DelayedArray and HDF5Array) to support these
Hi Pete,
Thanks for clarifying.
Note that since [
is delayed for DelayedArray and HDF5Array objects, something like rowMedians(x[rows, cols])
would probably be good enough from a performance point of view. Actually if the rowMedians()
methods for DelayedArray or HDF5Array objects were to support the rows
and cols
args, they would probably do just that. It's not immediately obvious to me how they could do something much smarter than that, at least in the general case.
H.
Thanks for the clarification Pete.
However, now I am hessitant to change back and remove rows
and cols
from the generic signature. In particular because I am not sure what is the best way to handle the order of the method arguments without them. In matrixStats
the order usually is x
, rows
, cols
, na.rm
(if applicable), dims.
, and the ellipsis (...
).
My initial suggestion basically was:
setGeneric("rowMedians", function(x, na.rm=FALSE, ...) standardGeneric("rowMedians"),signature = "x")
This means that the method for x of type matrix
would have to be
setMethod("rowMedians", signature = "matrix", function(x, na.rm=FALSE, rows = NULL, cols = NULL, ...){
matrixStats::rowMedians(x, rows = rows, cols = cols, na.rm=na.rm, ...)
})
if I want to document the existince of the rows
and cols
arguments.
Note the changed order of arguments. If the arguments had the order of the original matrixStats
function, rows
would get the argument for the na.rm
.
However, this has the problem that this approach breaks existing method calls for matrixStats
that don't use named arguments.
In conclusion, I think I cannot avoid having rows
and cols
as second and third argument for the generic, if I want to include na.rm
in the signature (Sorry for the confusion).
Hi Constantin,
I'm not a big fan of having the rows
and cols
arguments in the generics but I don't have a strong opinion on this. I think the risk of breaking existing code if we don't include them in the generic is probably very low. Would only happen if (1) the code is in a package that doesn't make proper use of NAMESPACE and (2) the code doesn't name the arguments when calling rowMedians()
. So would only break code that deserves it ;-)
Anyway, up to you and Pete. H.
Thanks for the input. I still feel slightly uneasy about not including the rows
and cols
argument. However, if Hervé says that incompatibility probably won't be a big issue, I am inclined to trust him. I would leave it up to you Pete to decide :)
And if you guys decide to drop rows
and cols
, it's fine to blame me later if it causes problems so you're covered ;-)
And if you guys decide to drop rows and cols, it's fine to blame me later if it causes problems so you're covered ;-)
That's an offer too good to pass up ;) Dropping them from the generic but aiming to implement them when possible/sensible sounds good to me.
I changed the signature again and removed rows
and cols
from the generic.
If we all agree now with the general structure of the generic, I will go ahead now and define the other generics.
A few more things:
There are 2 approaches for the arguments of the methods. Either you stick to the same argument list as in the generic:
setMethod("rowMedians", "matrix",
function(x, na.rm=FALSE, ...)
matrixStats::rowMedians(x, na.rm=na.rm, ...)
)
or you explicitly list all the method-specific arguments:
setMethod("rowMedians", "matrix",
function(x, na.rm=FALSE, rows=NULL, cols=NULL, dim.=dim(x))
matrixStats::rowMedians(x, na.rm=na.rm, rows=rows, cols=cols, dim.=dim.)
)
Two advantages of the 2nd approach:
...
) at the end of their argument list even though it's not used. I'm not sure what drove this design but it has the very unfortunate consequence that misspelled arguments are silently ignored:
> matrixStats::rowMedians(matrix(1:12, ncol=3), dims=c(2, 6))
[1] 5 6 7 8
The 2nd approach addresses that:
> rowMedians(matrix(1:12, ncol=3), dims=c(2, 6))
Error in .local(x, na.rm, ...) : unused argument (dims = c(2, 6))
Two inconveniences of the 2nd approach:
The body of the method gets actually wrapped inside the "infamous .local()
hack":
> selectMethod("rowMedians", "matrix")
Method Definition:
function (x, na.rm = FALSE, ...)
{
.local <- function (x, na.rm = FALSE, rows = NULL, cols = NULL,
dim. = dim(x))
matrixStats::rowMedians(x, na.rm = na.rm, rows = rows, cols = cols,
dim. = dim.)
.local(x, na.rm, ...)
}
Signatures:
x
target "matrix"
defined "matrix"
This makes debugging a little harder.
Anyway, to me, the benefit of catching misspelled arguments outweighs the inconvenience of having a little extra work to do to define the method and of making debugging a little harder.
dim.
in method for numeric
vectorsWith no default value, one gets:
> rowMedians(1:12)
Error in matrixStats::rowMedians(x, na.rm = na.rm, dim. = dim., ...) :
argument "dim." is missing, with no default
With a default value of dim(x)
(like in matrixStats::rowMedians
), one gets:
> rowMedians(1:12)
Error in matrixStats::rowMedians(x, na.rm = na.rm, dim. = dim., ...) :
Argument 'dim' must be an integer vector of length two.
This is consistent with matrixStats::rowMedians
:
> matrixStats::rowMedians(1:12)
Error in matrixStats::rowMedians(1:12) :
Argument 'dim' must be an integer vector of length two.
Alternatively rowMedians()
could treat the vector as a 1-column matrix by default (that's what the matrix()
constructor does). This is achieved by having dim.
set to c(length(x), 1L)
by default:
> rowMedians(1:12)
[1] 1 2 3 4 5 6 7 8 9 10 11 12
,
) in your setGeneric
statement is not needed:setGeneric("rowMedians", function(x, na.rm=FALSE, ...) standardGeneric("rowMedians"),
signature="x",
)
Thanks!
Thank you again @hpages for your input.
Maybe, @HenrikBengtsson can weigh in why he chose the to add ...
to every function and if it is safe to leave it out of the method signature for x
of type matrix
.
For the default of dim.
with x
of type numeric
, I like the error message that complaints about missing dim.
better, because it feels misleading to have a default argument for the function that I know will fail at runtime. However, I also don't want to just decide that a vector is equivalent to a matrix with one column, because it differs from the behavior of matrixStats
.
@PeteHaitch, can you confirm that the implementation is okay now?
If so, I would go ahead with the other functions :)
I'm out office until Monday but will take a look then
Methods that provide an implemention for the S4 for generic are encouraged to handle the
rows
/cols
subsetting.
Can we rephrase with something like:
When implementing a method for their own matrix-like objects, developers are encouraged to support the
rows
andcols
arguments.
Also in:
The S4 methods for
x
of typematrix
ornumeric
are dispatched tomatrixStats::rowMedians
I would just use "are calling". Also for completeness, refer to matrixStats::colMedians
:
The S4 methods for
x
of typematrix
ornumeric
are callingmatrixStats::rowMedians
(matrixStats::colMedians
).
And in "see also" section:
matrixStats::rowMedians()
andmatrixStats::colMedians()
which are used for ‘matrix’ input.
something like:
matrixStats::rowMedians()
andmatrixStats::colMedians()
which are used when the input is amatrix
orvector
.
Thanks again.
Now that I've had time to look more closely, I'm again slightly worried that not having rows
and cols
in the signature changes the argument order compared to that in matrixStats.
I feel like I'm going around in circles ...
In DelayedMatrixStats I took the approach of treating the matrixStats function's arguments as the de facto arguments in the generic, with anything else via ...
, on the basis of historical precedence.
A method needn't make use of all arguments in the signature (it may even report a message that it's ignoring certain arguments) but I think having 'one signature to rule them all' makes life easier for the user.
A side note, @const-ae you may want to look into roxygen2's templated and inherited params. I make use of these in DelayedMatrixStats to reduce the documentation I have to write
E.g., https://github.com/PeteHaitch/DelayedMatrixStats/blob/master/R/AllGenerics.R#L219 (inherited params) and https://github.com/PeteHaitch/DelayedMatrixStats/blob/master/R/colMedians.R#L41 and https://github.com/PeteHaitch/DelayedMatrixStats/blob/master/man-roxygen/lowercase_x.R (templated params)
I feel like I'm going around in circles ...
Haha, I feel you :D
A compromise could be to only make arguments after na.rm
optional. Ie.
setGeneric("rowMedians", function(x, rows = NULL, cols = NULL, na.rm=FALSE, ...) standardGeneric("rowMedians"),
signature = "x"
)
setMethod("rowMedians", signature = "matrix", function(x, rows = NULL, cols = NULL, na.rm=FALSE, dim. = dim(x)){
matrixStats::rowMedians(x, rows = rows, cols = cols, na.rm=na.rm, dim. = dim.)
})
This would mean that new implementations would need to handle rows
and cols
arguments. However, it can, as you say, always just throw an error. I think, this would be compatible with DelayedMatrixStats
because it doesn't change the order of the arguments and doesn't put an undue burden on new packages.
A side note, @const-ae you may want to look into roxygen2's templated and inherited params.
Thanks, I was aware of the interited params. But I didn't realize how easy it is to create your own templates. This is definitely something we should use here. (However, maybe we should first settle on the signature :))
To me defining generics is also a one-time opportunity to start fresh with a clean signature. Anyway, time to move forward.
Can I suggest that we decide of a small subset of generics (e.g. the dozen or so most frequently used row/col summarization functions in Bioconductor) and try to get a fully working matrixStats/MatrixGenerics/sparseMatrixStats/DelayedMatrixStats stack with this subset, rather than aiming at getting the full set in MatrixGenerics/sparseMatrixStats? This will give us something concrete to play with and it'll be easier to make adjustments if the set is small. Once we are happy with what we have, we'll extend to the full set.
Based on some quick grep
-based inspection, the matrixStats operations used by Bioconductor software packages are (looking at the row*
functions only):
rowMedians
, rowVars
, rowSds
, rowMaxs
, rowMins
, rowMeans2
, rowSums2
rowRanges
(expect some conflicts with rowRanges
accessor for RangedSummarizedExperiment objects), rowRanks
, rowQuantiles
, rowMads
, rowIQRs
rowAlls
, rowCumsums
, rowWeightedMeans
, rowAnyNAs
A good subset for prototyping could be the 7 heavily used ones + rowRanges
+ the corresponding col*
functions. So 16 functions in total (instead of 74).
I have added the 16 functions that Hervè suggested to MatrixGenerics
.
I have also created a new branch for the sparseMatrixStats
package that implements those generics: https://github.com/const-ae/sparseMatrixStats/compare/alternative_proposal
I created a PR for the DelayedMatrixStats
package that shows how it could use MatrixGenerics
functions: https://github.com/PeteHaitch/DelayedMatrixStats/pull/59
My impression is that it is pretty straightforward to make sparseMatrixStats
and DelayedMatrixStats
depend on MatrixGenerics
. If you agree, I would go ahead and implement the others functions and make MatrixGenerics
ready for submission on Bioconductor/CRAN.
Thanks @const-ae .
A few more things (again, sorry):
Calling rowMedians()
on an ordinary array should also call matrixStats::rowMedians()
. Easiest/cleanest way to do this is to simply replace "matrix"
with "array"
in the method for matrix
.
In order to avoid the code repetition between the method for matrix
and method for numeric
, I would strongly suggest to factorize the definition of the method e.g. by doing something like:
.default_rowMedians <-
function(x, rows=NULL, cols=NULL, na.rm=FALSE, dim.=dim(x))
matrixStats::rowMedians(x, rows=rows, cols=cols, na.rm=na.rm, dim.=dim.)
setMethod("rowMedians", signature="array", .default_rowMedians)
setMethod("rowMedians", signature="numeric", .default_rowMedians)
You need import(matrixStats)
in NAMESPACE
.
Hi @hpages,
I've added support for array
input and also refactored the code to use the .default_rowMedians
function.
However, I have kept the explicit method for x of type matrix
although I know it is redundant. Nonetheless, I believe that it makes the documentation easier to understand for a newcomer who may not know that matrix
is an array
.
Could you provide some information why it is necessary to explicitly import matrixStats
in the NAMESPACE
if I always refer to the functions with the double colon notation?
Hi,
earlier today, @PeteHaitch gave me write access to the repository. If it is okay for you, I will merge this PR in the next days. In the next weeks, I will try to add the remaining functions following the pattern that is implemented for the existing 16 functions.
My goal is to submit the package to Bioconductor in the next month so that DelayedMatrixStats
and SparseMatrixStats
have enough time to implement the generics in the next Bioconductor release.
Best, Constantin
Yep, sounds good @const-ae
Could you provide some information why it is necessary to explicitly import matrixStats in the NAMESPACE if I always refer to the functions with the double colon notation?
Maybe it's not (I thought R CMD check
would issue a warning or a note but apparently not). Oh and I see now that this is what DelayedMatrixStats is doing too. Note that this means that matrixStats doesn't get loaded via a namespace after loading MatrixGenerics:
> library(MatrixGenerics)
> sessionInfo()
R Under development (unstable) (2019-10-30 r77336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /home/hpages/R/R-4.0.r77336/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r77336/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MatrixGenerics_0.0.99
loaded via a namespace (and not attached):
[1] compiler_4.0.0
The loading of matrixStats is delayed until it's actually needed:
colMaxs(matrix(11:20, ncol=2))
# [1] 15 20
Now sessionInfo()
shows it:
> sessionInfo()
R Under development (unstable) (2019-10-30 r77336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /home/hpages/R/R-4.0.r77336/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r77336/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MatrixGenerics_0.0.99
loaded via a namespace (and not attached):
[1] compiler_4.0.0 matrixStats_0.55.0
Another notable difference is that this makes installation of MatrixGenerics a lot more quiet i.e. we no longer see the usual bunch of messages Creating a new generic function for ‘rowMaxs’ in package ‘MatrixGenerics’ for each generic defined in the package that masks a function defined in matrixStats. Maybe that's actually a good thing. Nobody pays attention to these messages anyway (they don't really say anything important or useful).
I believe that it makes the documentation easier to understand for a newcomer who may not know that matrix is an array.
A matrix has always been an array, except in R ;-) So I don't think newcomers would ever imagine that a matrix is not an array. The only people to think so are people who have been using R for too long ;-)
Thanks!
Hi Peter,
I finally got around to write down my alternative proposal for the matrix stats generics. After the idea to add the generics directly to the
Matrix
package has stalled, I went ahead and implemented the generics here based on the discussion in https://github.com/Bioconductor/MatrixGenerics/issues/4.To summarize, I made the following design decisions:
ANY
fails with an error message instead of coercing the input tomatrix
. This was suggested by Hervé Pagès and I found it convincing because there is the danger to crash the users computer if a large matrix is provided, so the user should rather explicitly coerce the input themselves.matrix
thematrixStats
package is called if it is installed. Otherwise a error message is provided.rows
,cols
, and.dim
from the generic. To me, those seem to be specific performance optimizations provided bymatrixStats
, rather than essential parameters that every possible downstream implementation should provide.If you like the proposal, I would go ahead and implement the other functions in the same style.