HenrikBengtsson / matrixStats

R package: Methods that Apply to Rows and Columns of Matrices (and to Vectors)
https://cran.r-project.org/package=matrixStats
202 stars 33 forks source link

BUG: weightedVar() returning incorrect (including negative) values #105

Open PeteHaitch opened 7 years ago

PeteHaitch commented 7 years ago

Hi Henrik,

I encountered a situation where matrixStats::colWeightedSds() was giving me NaN variances. After doing some digging it appears this is due to matrixStats::weightedVar() returning negative variance estimates. Here's a minimal reproducible example.

I also compared matrixStats::weightedVar() to Hmisc::wtd.var() as my understanding is these should give the same result, right?

Thanks, Pete

n <- 3
x <- seq_len(n)
w <- c(0.1, 0.2, 0.6)
w2 <- w / min(w)
w3 <- w / sum(w)

matrixStats::weightedVar(x = x, w = w)
#> [1] -4.222222
matrixStats::weightedVar(x = x, w = w2)
#> [1] 0.5277778
matrixStats::weightedVar(x = x, w = w3)
#> [1] Inf

Hmisc::wtd.var(x = x, weights = w)
#> [1] 0.95
Hmisc::wtd.var(x = x, weights = w2)
#> [1] 0.95
Hmisc::wtd.var(x = x, weights = w3)
#> [1] 0.95
Session info ``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.4.1 (2017-06-30) #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> tz America/New_York #> date 2017-08-27 #> Packages ----------------------------------------------------------------- #> package * version date source #> acepack 1.4.1 2016-10-29 CRAN (R 3.4.0) #> backports 1.1.0 2017-05-22 CRAN (R 3.4.0) #> base * 3.4.1 2017-07-07 local #> base64enc 0.1-3 2015-07-28 CRAN (R 3.4.0) #> checkmate 1.8.3 2017-07-03 CRAN (R 3.4.1) #> cluster 2.0.6 2017-03-10 CRAN (R 3.4.1) #> colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0) #> compiler 3.4.1 2017-07-07 local #> data.table 1.10.4 2017-02-01 CRAN (R 3.4.0) #> datasets * 3.4.1 2017-07-07 local #> devtools 1.13.3 2017-08-02 CRAN (R 3.4.1) #> digest 0.6.12 2017-01-27 CRAN (R 3.4.0) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.4.0) #> foreign 0.8-69 2017-06-22 CRAN (R 3.4.1) #> Formula 1.2-2 2017-07-10 CRAN (R 3.4.1) #> ggplot2 2.2.1 2016-12-30 CRAN (R 3.4.0) #> graphics * 3.4.1 2017-07-07 local #> grDevices * 3.4.1 2017-07-07 local #> grid 3.4.1 2017-07-07 local #> gridExtra 2.2.1 2016-02-29 CRAN (R 3.4.0) #> gtable 0.2.0 2016-02-26 CRAN (R 3.4.0) #> Hmisc 4.0-3 2017-05-02 CRAN (R 3.4.0) #> htmlTable 1.9 2017-01-26 CRAN (R 3.4.0) #> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0) #> htmlwidgets 0.9 2017-07-10 CRAN (R 3.4.0) #> knitr 1.17 2017-08-10 CRAN (R 3.4.1) #> lattice 0.20-35 2017-03-25 CRAN (R 3.4.1) #> latticeExtra 0.6-28 2016-02-09 CRAN (R 3.4.0) #> lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0) #> magrittr 1.5 2014-11-22 CRAN (R 3.4.0) #> Matrix 1.2-11 2017-08-16 CRAN (R 3.4.1) #> matrixStats 0.52.2 2017-04-14 CRAN (R 3.4.0) #> memoise 1.1.0 2017-05-26 Github (hadley/memoise@e372cde) #> methods * 3.4.1 2017-07-07 local #> munsell 0.4.3 2016-02-13 CRAN (R 3.4.0) #> nnet 7.3-12 2016-02-02 CRAN (R 3.4.1) #> plyr 1.8.4 2016-06-08 CRAN (R 3.4.0) #> RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.0) #> Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.0) #> rlang 0.1.2 2017-08-09 CRAN (R 3.4.1) #> rmarkdown 1.6 2017-06-15 CRAN (R 3.4.0) #> rpart 4.1-11 2017-03-13 CRAN (R 3.4.1) #> rprojroot 1.2 2017-01-16 CRAN (R 3.4.0) #> scales 0.4.1 2016-11-09 CRAN (R 3.4.0) #> splines 3.4.1 2017-07-07 local #> stats * 3.4.1 2017-07-07 local #> stringi 1.1.5 2017-04-07 CRAN (R 3.4.0) #> stringr 1.2.0 2017-02-18 CRAN (R 3.4.0) #> survival 2.41-3 2017-04-04 CRAN (R 3.4.1) #> tibble 1.3.4 2017-08-22 CRAN (R 3.4.1) #> tools 3.4.1 2017-07-07 local #> utils * 3.4.1 2017-07-07 local #> withr 2.0.0 2017-07-28 CRAN (R 3.4.1) #> yaml 2.1.14 2016-11-12 CRAN (R 3.4.0) ```
HenrikBengtsson commented 7 years ago

Thanks Pete, I can reproduce. It certainly looks like a bug to me.

HenrikBengtsson commented 6 years ago

Just a heads up; I don't think this is not really a bug in the code per se, but rather a oversight in this weighted estimator. After briefly thinking about it, it's likely one wants to us different types of estimators depending on what type of weights one uses and whether the estimator should be unbiased or not. Some serious thoughts needs to go into this and ideal we should find prior work on this, i.e. locate trustworthy sources that provide well studied estimators. Because of this, I would hold off using matrixStats::weightedVar() until resolved.

PeteHaitch commented 6 years ago

FWIW, Hmisc::wtd.var() ultimately calls stats::cov.wt() after some processing of the weights. Unfortunately, in a cursory look the GitHub mirror of the R source, I couldn't find much about who implemented stats::cov.wt() or a reference.

HenrikBengtsson commented 6 years ago

A reference is https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance which covers estimators for when the weights are (i) "frequency" weights, and when they are (ii) "reliability" / "normalization" weights. I believe this is also what Frank Harrell talks about his ?Hmisc::wtd.var.

HenrikBengtsson commented 6 years ago

A few more notes on this: so, there is no one estimator for the standard error of the weighted mean. Several different has been proposed (cf. Gatz & Smith (1995) and they all have different properties. Because of this, weightedVar() needs to implement different estimators.

The question is then (a) which set of them, (b) which should be the default, and (c) how to specify which estimator? As a started I think (a) weightedVar() should support the estimators that Hmisc::wtd.var() supports, and (b) use the same default. The rationale for that is based on trust in Harrell's work. For (c), contrary to Hmisc::wtd.var(), we might use a single argument called method (character string) to specify what type of estimator. That will allow us to add more exotic estimators in the future, or emulate other software (e.g. WinCross, Quantum, SPSS) if not already supported by existing methods.

PeteHaitch commented 6 years ago

Agreed on (a) and (b). Not sure I understand (c); Hmis::wtd.var() has a method argument

> args(Hmisc::wtd.var)
function (x, weights = NULL, normwt = FALSE, na.rm = TRUE, method = c("unbiased", 
    "ML")) 
NULL

In any case, right now I have no strong opinion on the argument names

HenrikBengtsson commented 6 years ago

Yeah, but normwt also comes in play. So, that should (probably) be incorporated in method.

HenrikBengtsson commented 6 years ago

Now... in the Hmisc update 4.0-3 => 4.1-0, they actually made the default of wtd.var() work identically with weightedVar(). Example:

n <- 3
x <- seq_len(n)
w <- c(0.1, 0.2, 0.6)
w2 <- w / min(w)
w3 <- w / sum(w)

s2m1 <- matrixStats::weightedVar(x = x, w = w)
s2m2 <- matrixStats::weightedVar(x = x, w = w2)
s2m3 <- matrixStats::weightedVar(x = x, w = w3)

s2h1 <- Hmisc::wtd.var(x = x, weights = w)
s2h2 <- Hmisc::wtd.var(x = x, weights = w2)
s2h3 <- Hmisc::wtd.var(x = x, weights = w3)

stopifnot(s2m1 == s2h1, s2m2 == s2h2, s2m3 == s2h3)

Now, what's more interesting is that this was also the behavior in Hmisc (<= 4.0.2). So, the original issue you reported did only apply to Hmisc 4.0.3, which was available 2017-04-30 -- 2017-12-19, and the version you happened use.

UPDATE 2018-01-21:

HenrikBengtsson commented 6 years ago

Just like in Hmisc (>= 4.1.0), weightedVar(x, w) now produces a warning when the estimate is invalid:

> options(warn = 1)

> n <- 3
> x <- seq_len(n)
> w <- c(0.1, 0.2, 0.6)
> w2 <- w / min(w)
> w3 <- w / sum(w)

> matrixStats::weightedVar(x = x, w = w)
Warning in matrixStats::weightedVar(x = x, w = w) :
  Produced invalid variance estimate, because the weights suggest at most one effective observation (sum(w) <= 1): -4.22222 (wsum = 0.9)
[1] -4.222222

> matrixStats::weightedVar(x = x, w = w2)
[1] 0.5277778

> matrixStats::weightedVar(x = x, w = w3)
Warning in matrixStats::weightedVar(x = x, w = w3) :
  Produced invalid variance estimate, because the weights suggest at most one effective observation (sum(w) <= 1): Inf (wsum = 1)
[1] Inf

Maybe one could argue it should return NaN or similar.

PeteHaitch commented 6 years ago

Interesting that Hmisc went through the same issue! I think this is a good solution for matrixStats.

I find it a little surprising that the default option of Hmisc::wtd.var() is normwt = FALSE which consequently means that scaling the weights can lead to (vastly) different estimates and even 'invalid' estimates. But clearly people who think about weighting will know better than me. What do you think about adding a normwt-type argument to matrixStats::weightedVar()?

HenrikBengtsson commented 6 years ago

Since the current implementation is no longer a "bug" per se, I'll push forward to release the next version of the package without further modifications to this function.

However, I'm certainly open to add support for alternative estimators here. We need to strive carefully though and understanding what we're adding since adding features is easy but removing them is complicated.

So, I think identifying and understand estimators is key. Then we can start adding them one by one. Adding support for the ones in Hmisc makes sense.

PeteHaitch commented 6 years ago

Sounds good. Do you have an ETA on the next release?

HenrikBengtsson commented 6 years ago

I'm doing the final cleanup (#119) and then I'll be running the rev dep checks on ~160 packages. If all go well I'll submit to CRAN after that.

HenrikBengtsson commented 6 years ago

matrixStats 0.53.0 is now on CRAN