Bioconductor / MatrixGenerics

S4 Generic Summary Statistic Functions that Operate on Matrix-Like Objects
https://bioconductor.org/packages/MatrixGenerics
12 stars 1 forks source link

Where should these generics live? #2

Open PeteHaitch opened 6 years ago

PeteHaitch commented 6 years ago

matrixStats 'owns' much of the API we (DelayedMatrixStats, matter, etc.) are trying to follow/mimic. The remainder is defined in base (e.g, colSums()); there may be other generics we wish to add (e.g., prcomp()). However, for performance reasons matrixStats uses ordinary functions rather than S4 (or S3). We want generic versions of these functions so that we can write methods that operate on matrix-like data structure; within Bioconductor, this generally means S4 generics but it may be worth also supporting S3 generics.

Based on the discussion in https://stat.ethz.ch/pipermail/bioc-devel/2017-November/012273.html, I think our options are:

  1. Putting the generics in a Bioconductor package: a. A stand-alone MatrixGenerics package b. Sticking them in BiocGenerics
  2. Putting the generics in a CRAN package: a. A stand-alone MatrixGenerics package c. Sticking them in the matrixStats package
  3. Putting the generics in a required/recommended package: a. Sticking them in the Matrix package b. Sticking them in the stats4 package c. Sticking them in the the methods package
LTLA commented 6 years ago

Something like 3a sounds like the way to go, if the Matrix maintainers are on board. All downstream packages would strictly provide method implementations and no further generics - this would include matrixStats. Performance-wise, I can't imagine the time spent in S4 dispatch is all that much, so I can't see that switching from raw functions to S4 is a big deal.

mtmorgan commented 6 years ago

Better to check with @HenrikBengtsson about performance issues; my guess is that he's unwilling (rationally or otherwise!) to accept the cost. Also @mmaechler will of course have a significant stake in this...

mmaechler commented 6 years ago

I'm not yet ready to delve into the discussion... but I tend to agree that there's a need for a "central" place for S4 generics. Are you aware of implicitGenerics (in package 'methods')? We had added quite a few there... but of course only for functions existing in base (or "base packages"). For other matrix-specific functions, I'd tend to agree it would make sense to have them in (and subsequently "from") Matrix.

PeteHaitch commented 6 years ago

Are you aware of implicitGenerics (in package 'methods')?

I'm aware, but unsure if I'm using them correctly (cf. https://github.com/Bioconductor/MatrixGenerics/issues/3)

HenrikBengtsson commented 6 years ago

My take on this: Overhead of method dispatching, and also ::, is significant at some point and to some users in some use cases, e.g. millions of calculations of non-large matrices. You don't want to force unnecessary overhead on developers when it comes to time-critical processing. This is one of the objectives/roles of the matrixStats package and there is a need for its highly performance-optimized implementation. If one ignores this, someone will be waiting around the corner to implement matrixStats2 without such overhead.

The solution? Two APIs - one with minimum overhead and one that supports method dispatching. The two APIs can be separated by two different naming conventions(*). The dispatching API can default to calling the minimum-overhead one where possible (here the matrixStats API).

(*) In base R there are already a few cases of this where one functions come with minimal overhead and another with bells and whistles because developers/users need access to both, e.g. .colSums() vs colSums().