Bioconductor / MatrixGenerics

S4 Generic Summary Statistic Functions that Operate on Matrix-Like Objects
https://bioconductor.org/packages/MatrixGenerics
12 stars 1 forks source link

Dispatch on methods for matrix objects when table objects are supplied #15

Closed hpages closed 4 years ago

hpages commented 4 years ago

With this change, MatrixGenerics::rowVars(table(a=letters, A=LETTERS)) works.

const-ae commented 4 years ago

Hey, great idea. If I understand correctly, the problem is that matrixStats uses isMatrix to decide if it can handle an argument. However, the isMatrix function only ever checks if the object has a 2d dim attribute (R-exts): ie. it is also TRUE for objects of class table even though they don't have class matrix.

I see that you chose to use a union class of table, array, numeric to broaden the applicability of the S4 generics. I wonder if there is an alternative that sticks closer to the way that matrixStats checks the argument, so that we don't have to play catch-up and update the union class whenever somebody wants to use MatrixGenerics with an object that is 2d, but not of class table or array?

I am not quite sure how this would work best with the S4 class and dispatch system, so I would trust your judgement Hervé if your say this is not easily possible. However, in that case I wonder if it made sense to rename the class to something more generic (e.g. matrixStats_acceptable_input_class instead of matrix_OR_array_OR_table_OR_numeric) so that we don't have to rename each instance across all 70 files when we add a new class to the union.

hpages commented 4 years ago

It's only 77 occurrences of matrix_OR_array_OR_table_OR_numeric in 37 files! ;-)

Good point about the name of the class. I thought of a name like matrixStats_acceptable_input_class that wouldn't need to be changed every time we find out that there is a new kind of object that can be handled by matrixStats. However I was worried that this would look awkward from an end-user point of view (the name of the class appears in the man pages in the Usage section) and possibly confusing. I find that an explicit name like matrix_OR_array_OR_table_OR_numeric is more user-friendly.

Also, even though I can't be 100% sure, I'm hoping that with the addition of table we are now covering all the base things that could land in matrixStats arms. In the unlikely event that we are not, I'll add the new member to the class union and rename the class. Replacing the occurrences of matrix_OR_array_OR_table_OR_numeric in all the files can be done programmatically (with sed) so is very easy (and much easier than adding 72 setMethod statements if we were not using a union class).

LTLA commented 4 years ago

Back in the old days for BiocNeighbors (I don't remember exactly which package), I did something like:

spawnMethods(generic, FUN) {
    for (cls in c("matrix", "numeric", "table")) {
        setMethod(generic, cls, FUN)
    }
}

# to be used like:
spawnMethods("colVars", .matrixStats_colVars)
spawnMethods("rowVars", .matrixStats_rowVars)

I think there was some more fiddling required for the where=, but this gave me methods for all classes without much typing. Biggest problem was that roxygen wasn't happy with it, but I wasn't using it at the time anyway.

hpages commented 4 years ago

Right, we have setMethods in S4Vectors that does what spawnMethods does but using something like this in this case would mean generating 72 x 4 methods instead of 72. That sounds like a lot of clutter in the method tables. It would also mean ending up with a lot of clutter in the Usage sections of the 36 man pages because roxygen2 would then report 8 methods per man page instead of 2 only. Note that all the methods in a given man page have exactly the same argument list so this level of repetition is not particularly useful.

LTLA commented 4 years ago

For roxygen, I would imagine that it's actually the opposite problem; it might struggle to link the methods to the documentation if everything is happening inside spawnMethods(). Which, in this case, might be a blessing in disguise as you don't get the clutter in the docs for all of the individual methods for a given generic; but you would still need to do some @eval shenanigans to automatically populate each Rd file with the \alias{} for all methods.

hpages commented 4 years ago

I just tried this and it seems that roxygen2 indeed wouldn't "see" the methods so they don't show up in the Usage section of the man page, only the generic. Also the aliases for the methods don't get added to the Rd file so we end up with a bunch of R CMD check warnings e.g.:

* checking for missing documentation entries ... WARNING
Undocumented S4 methods:
  generic 'rowVars' and siglist 'array'
  generic 'rowVars' and siglist 'matrix'
  generic 'rowVars' and siglist 'numeric'
  generic 'rowVars' and siglist 'table'
  ... etc ... (full list would be 72 x 4 long)

Maybe there is a way to "trick" roxygen2 into doing the right thing, I don't know.

In any case it doesn't seem that a setMethods/spawnMethods solution would make the content of the man pages particularly easy to control or the package particularly easy to maintain. And we would still bloat the method tables with a myriad of redundant methods.

@PeteHaitch are you ok with the current proposal? If so I'll merge this week.

PeteHaitch commented 4 years ago

I'm on leave until Oct 13, so I'm not trying things out for myself but just going on what's posted here in the PR. That said, it generally looks good to me so I'm happy for you to merge when you like.

hpages commented 4 years ago

ok, thanks all for the feedback

hpages commented 4 years ago

Wait! And what about factors? Seems like matrixStats accepts them (even though with some gotcha), Well, turns out that in that case we are lucky:

library(MatrixGenerics)
x <- factor(c("z", "b"), levels=letters)
m <- `dim<-`(x, 2:1)  # factor matrix
a <- `dim<-`(x, c(1:2, 1))  # factor array

is(x, "matrix_OR_array_OR_table_OR_numeric")
# [1] TRUE

is(m, "matrix_OR_array_OR_table_OR_numeric")
# [1] TRUE

is(a, "matrix_OR_array_OR_table_OR_numeric")
# [1] TRUE

And this despite the following:

is(x, "matrix") || is(x, "array") || is(x, "table") || is(x, "numeric")
# [1] FALSE

is(m, "matrix") || is(m, "array") || is(m, "table") || is(m, "numeric")
# [1] FALSE

is(a, "matrix") || is(a, "array") || is(a, "table") || is(a, "numeric")
# [1] FALSE

So it works only by chance and because we are in one of those S4 dark corners. No guarantees that this will still work in the future (e.g. when this mess gets fixed in the methods package).

Anyways, I don't expect people to call MatrixGenerics functions on their factors in normal situations. Also it's not clear that the matrixStats package intendedly accepts them.