HenrikBengtsson / matrixStats

R package: Methods that Apply to Rows and Columns of Matrices (and to Vectors)
https://cran.r-project.org/package=matrixStats
203 stars 33 forks source link

API cleanup: Drop argument `center`? #187

Closed HenrikBengtsson closed 3 years ago

HenrikBengtsson commented 3 years ago

Argument center for rowVars() exists for historical reasons, it causes confusion, it introduces a significant risk for mistakes/bugs, e.g. https://github.com/HenrikBengtsson/matrixStats/issues/183, https://github.com/PeteHaitch/DelayedMatrixStats/issues/68, and https://github.com/const-ae/sparseMatrixStats/issues/13.

The main reason for it existing is that in the beginning all these functions were implemented in plain R and then there was a significant performance gain to reuse an already estimated center point estimate, e.g.

mu <- rowMeans(X)
sd <- rowSds(X, center = mu)

The benefit if this is smaller today, or even non-existing, because when using center we no longer use the native code but the old R code branch.

Due to the problems that come from using an incorrect value of center, should we deprecate/drop center from the API?

Argument center is currently used in:

> grep("center", capture.output(ls.str("package:matrixStats")), value=TRUE)
 [1] "colMads : function (x, rows = NULL, cols = NULL, center = NULL, constant = 1.4826, na.rm = FALSE, dim. = dim(x), ...)  "   
 [2] "colSds : function (x, rows = NULL, cols = NULL, na.rm = FALSE, center = NULL, dim. = dim(x), ...)  "                       
 [3] "colVars : function (x, rows = NULL, cols = NULL, na.rm = FALSE, center = NULL, dim. = dim(x), ...)  "                      
 [4] "colWeightedMads : function (x, w = NULL, rows = NULL, cols = NULL, na.rm = FALSE, constant = 1.4826, center = NULL, ...)  "
 [5] "rowMads : function (x, rows = NULL, cols = NULL, center = NULL, constant = 1.4826, na.rm = FALSE, dim. = dim(x), ...)  "   
 [6] "rowSds : function (x, rows = NULL, cols = NULL, na.rm = FALSE, center = NULL, dim. = dim(x), ...)  "                       
 [7] "rowVars : function (x, rows = NULL, cols = NULL, na.rm = FALSE, center = NULL, dim. = dim(x), ...)  "                      
 [8] "rowWeightedMads : function (x, w = NULL, rows = NULL, cols = NULL, na.rm = FALSE, constant = 1.4826, center = NULL, ...)  "
 [9] "weightedMad : function (x, w = NULL, idxs = NULL, na.rm = FALSE, constant = 1.4826, center = NULL, ...)  "                 
[10] "weightedVar : function (x, w = NULL, idxs = NULL, na.rm = FALSE, center = NULL, ...)  "  

cc/ @const-ae, @hpages, @LTLA, @PeteHaitch

const-ae commented 3 years ago

I would be in favor of deprecating the center argument, because of the misleading result you get when you can pass something else then the mean to center (which people do, e.g. https://github.com/HenrikBengtsson/matrixStats/issues/160, https://github.com/HenrikBengtsson/matrixStats/issues/183)

PeteHaitch commented 3 years ago

Yes, I'd also be in favour of deprecation. If I understand correctly it's both:

  1. Easily misused/misunderstood
  2. No faster and perhaps slower (is this an implementation detail or something that will always be the case?)
HenrikBengtsson commented 3 years ago
  1. No faster and perhaps slower (is this an implementation detail or something that will always be the case?)

matrixStats::rowVars() and friends could certainly pass center down to the native (=C) code and use it there. Right now the native code does a two-pass scan over the data: (i) calculate the sample mean (="center") and then (ii) calculate the variance via (x[i] - center)^2. If center is known, we could skip the first step, so certainly faster and possibly also more likely to get cache hits.

A common use case is that we want to calculate (mean, variance), which means that we often end up calculating the 'mean' twice. One could get around this by providing rowMeansAndVars() but that quickly blows up the API. I think it's also been proposed to support something like rowVars(..., returnMean = TRUE).

const-ae commented 3 years ago

A common use case is that we want to calculate (mean, variance), which means that we often end up calculating the 'mean' twice. One could get around this by providing rowMeansAndVars() but that quickly blows up the API. I think it's also been proposed to support something like rowVars(..., returnMean = TRUE).

I know that speed is an absolute priority for matrixStats, however I personally prefer functions that do one thing and one thing only.

HenrikBengtsson commented 3 years ago

I know that speed is an absolute priority for matrixStats, however I personally prefer functions that do one thing and one thing only.

I guess you could say that several of them already do two things but just they don't share it.

There are other cases where this could be useful, e.g. returning the number of elements used when na.rm=TRUE.

FWIW, the birth of matrixStats came about when we needed to do lots of these (n, mu, sigma, ...) calculations across millions (sic!) of probesets in microarray data. Using the traditional R methods was just way too slow. So, matrixStats provides a way for developers to reach into these common low-level tasks with minimal overhead - things you'd normally turn to C/C++ to implement otherwise. I don't want to close that door because there's always someone out there who finds it useful as we did in the early days. The problem is to find a balance on API coverage and bloat.

On a related note, this low-level performance at the R level is also why I've always said no to wrap it up in S3 generics - it's just too expensive for many of the use cases. I'm sharing this because I think you guys are coming from other use cases where these types of overheads are relatively small compared to the other overheads.

karoliskoncevicius commented 3 years ago

I don't want to close that door because there's always someone out there who finds it useful as we did in the early days.

Sorry to hijack the conversation, but I am really curious - did you eventually move to a better way of doing things (outside of matrixStats), or rather stopped having data with millions of rows, so that the speed stopped being an issue?

HenrikBengtsson commented 3 years ago

Here's a possible deprecation plan:

  1. Add onUse <- getOption("matrixStats.center.onUse", "deprecated") with options to "ignore" and "defunct". Same here, this should also fall back to an environment variable MATRIXSTATS_CENTER_ONUSE. If onUse == "defunct" (will be the default one day), signal .Defunct() immediately. If onUse == "deprecate", then signal .Deprecated() immediately. But, for the latter, I think we should limit it to a few times per session because these functions might be called thousands of times or more in some pipelines and the performance penality would be huge for those pipelines.

  2. Run revdep checks with MATRIXSTATS_CENTER_ONUSE="defunct" will give us a sense of how many of the ~300 CRAN and Bioconductor packages are using the center arguments right now.

  3. Add an optional validation of the center argument, e.g. an internal onIncorrect <- getOption("matrixStats.center.onIncorrect", "ignore") and likely also via an environment variable R_MATRIXSTATS_CENTER_ONINCORRECT. If enabled, then center is compared to the internal estimate. If not the same, then the user has passed the wrong value all along. Produce a warning or an error if onIncorrect is "warning" or "error".

  4. Run revdep checks with R_MATRIXSTATS_CENTER_ONINCORRECT="error" to see if any of the packages in Step 2 have this mistake. This will let us know how bad this is; I'd be surprised if we don't pick something up but it could also be that few people actually use center.

  5. Release new version to CRAN with matrixStats.center.onUse="deprecate" as the default. The deprecation message should also inform that in a future version, matrixStats will also validate the current center (i.e. use matrixStats.center.onIncorrect="warning") by default.

  6. Release another version to CRAN with also matrixStats.center.onIncorrect="warning" being the default. This will catch the attention of additional users - beyond package developers.

  7. Release third version to CRAN with matrixStats.center.onIncorrect="error" being the default.

  8. Finally, release a fourth version to CRAN where argument center is fully defunct (=.Defunct()).

How fast this can be done will depend on how many package developers as well as end-users rely on center and how many of them are doing it correctly versus incorrectly. I guess, pulling the bandage on those who use it incorrectly shouldn't be too bad - it'll help them anyway.

It might also be that someone has a strong case to support center. If so, that should then probably be implemented in the native code. It should also be documented clearly what values that argument must and must not take. It would probably also be sound to rename and "hide-away" this argument to minimize any mistakes.

HenrikBengtsson commented 3 years ago

Sorry to hijack the conversation, but I am really curious - did you eventually move to a better way of doing things (outside of matrixStats), or rather stopped having data with millions of rows, so that the speed stopped being an issue?

It wasn't one table of millions of rows, the data came from outside in the form of millions of small matrix-like structures (e.g. Affymetrix data; see the affxparser on Bioconductor). We did not have control of the format of these input data. The total amount of data was/is also larger than RAM, so we couldn't/can't just load it all in and reshuffle. This type of data was/is used in many different pipelines and packages, so there was never an option to solve the data format once and for all (say using say a database) - it's a constant moving target, especially in the world of science where new models and methods are constantly developed and investigated.

hpages commented 3 years ago

when using center we no longer use the native code but the old R code branch.

I think this is actually at the root of all the confusion, misunderstandings, frustration, and sadness, around this issue ;-)

It might also be that someone has a strong case to support center. If so, that should then probably be implemented in the native code. It should also be documented clearly what values that argument must and must not take. It would probably also be sound to rename and "hide-away" this argument to minimize any mistakes.

I don't have a strong use case to support center but I wonder how hard it would be to implement it in the native code. Since the native implementation is already using the primary formula to calculate the variance, which is also its definition:

center <- mean(x)
sum((x - center)^2) / (length(x) - 1)

and not the alternate formula:

center <- mean(x)
sum(x^2 - center^2) / (length(x) - 1)

then it doesn't seem that it would be too hard to do.

This would:

  1. Provide a single path with a single formula for computing the "variance with arbitrary center" whatever the center is (i.e. supplied by the user or not).
  2. Provide the performance benefit that most of us were actually assuming when center is set to the pre-calculated rowMeans.
  3. Extend the scope of rowVars() to arbitrary centers. The most common use case is to use it with center=0 but it would work with any center and return something that's predictable/documented. There's not "wrong" center as long as the treatment of center is well documented and the user knows what they are doing. In particular, because rowVars() would now always use the primary formula, it would never return these shocking negative values that the current "dual path/dual formula" implementation sometimes returns.

Just my 2 cents.

HenrikBengtsson commented 3 years ago

Yes, an alternative path is to tighten up the definition of center. Using the "primary" formula is more in line with how the other methods, e.g. rowMads(). (FWIW, the "alternative" formula was probably chosen because it was more speed or memory efficient).

Note, changing from the alternative to primary formula requires making sure that no one is relying on the alternative formula being implemented.

As I mentioned in https://github.com/HenrikBengtsson/matrixStats/issues/183#issuecomment-705169785, we might see another group of people who wishes to use center for the true parameter value. If so, they need:

sum((x - center)^2) / n

and not

sum((x - center)^2) / (n - 1)

Note, if there are missing values dropped, one cannot just "adjust" afterward by (n - 1)/n because we don't know what n is without counting NAs. So, that's another rationale for being able to get ahold of n, which is something matrixStats does not support and has not been discussed much. Personally, there's been times where I wish I could access n.

I still have to think about the center=0 use case. It might be safer to expose sum-of-square functions, which internally call the native code with center = 0.

HenrikBengtsson commented 3 years ago

Ok, so I've made the decision to keep the center argument for at least the new few release cycles. However, I did go ahead and switched out the formula (#160), and I've tightened up the validation of length(center) and deprecated some use cases. This was done to alert developers and users on potentially incorrect usage of center. With this first milestone in place, I think it'll be a bit easier to discuss whether center should stay or not, or it should be renamed to minimize the risk for misuse. FWIW, there are only 3-4 packages out of 307 that make use of the center argument.

HenrikBengtsson commented 3 years ago

As said in my previous comment, we'll keep center for a bit longer (but let's not promote it) after tightening up what values and lengths it can take. The remaining part is whether or not the very special case of rowVars(..., center = 0) should be supported. The plan is to not support it in the next release. That is now tracked in https://github.com/HenrikBengtsson/matrixStats/issues/193. Please use that for discussions around that specific case.

Since center will be with us for several more release cycles, which means quite a long time, I'm closing this one for now. We can revisit the deprecation or renaming of center later/in a new issue.