Open PeteHaitch opened 7 years ago
Thanks Pete, I can reproduce. It certainly looks like a bug to me.
Just a heads up; I don't think this is not really a bug in the code per se, but rather a oversight in this weighted estimator. After briefly thinking about it, it's likely one wants to us different types of estimators depending on what type of weights one uses and whether the estimator should be unbiased or not. Some serious thoughts needs to go into this and ideal we should find prior work on this, i.e. locate trustworthy sources that provide well studied estimators. Because of this, I would hold off using matrixStats::weightedVar()
until resolved.
FWIW, Hmisc::wtd.var()
ultimately calls stats::cov.wt()
after some processing of the weights. Unfortunately, in a cursory look the GitHub mirror of the R source, I couldn't find much about who implemented stats::cov.wt()
or a reference.
A reference is https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance which covers estimators for when the weights are (i) "frequency" weights, and when they are (ii) "reliability" / "normalization" weights. I believe this is also what Frank Harrell talks about his ?Hmisc::wtd.var
.
A few more notes on this: so, there is no one estimator for the standard error of the weighted mean. Several different has been proposed (cf. Gatz & Smith (1995) and they all have different properties. Because of this, weightedVar()
needs to implement different estimators.
The question is then (a) which set of them, (b) which should be the default, and (c) how to specify which estimator? As a started I think (a) weightedVar()
should support the estimators that Hmisc::wtd.var()
supports, and (b) use the same default. The rationale for that is based on trust in Harrell's work. For (c), contrary to Hmisc::wtd.var()
, we might use a single argument called method
(character string) to specify what type of estimator. That will allow us to add more exotic estimators in the future, or emulate other software (e.g. WinCross, Quantum, SPSS) if not already supported by existing methods.
Agreed on (a) and (b). Not sure I understand (c); Hmis::wtd.var()
has a method
argument
> args(Hmisc::wtd.var)
function (x, weights = NULL, normwt = FALSE, na.rm = TRUE, method = c("unbiased",
"ML"))
NULL
In any case, right now I have no strong opinion on the argument names
Yeah, but normwt
also comes in play. So, that should (probably) be incorporated in method
.
Now... in the Hmisc update 4.0-3 => 4.1-0, they actually made the default of wtd.var()
work identically with weightedVar()
. Example:
n <- 3
x <- seq_len(n)
w <- c(0.1, 0.2, 0.6)
w2 <- w / min(w)
w3 <- w / sum(w)
s2m1 <- matrixStats::weightedVar(x = x, w = w)
s2m2 <- matrixStats::weightedVar(x = x, w = w2)
s2m3 <- matrixStats::weightedVar(x = x, w = w3)
s2h1 <- Hmisc::wtd.var(x = x, weights = w)
s2h2 <- Hmisc::wtd.var(x = x, weights = w2)
s2h3 <- Hmisc::wtd.var(x = x, weights = w3)
stopifnot(s2m1 == s2h1, s2m2 == s2h2, s2m3 == s2h3)
Now, what's more interesting is that this was also the behavior in Hmisc (<= 4.0.2). So, the original issue you reported did only apply to Hmisc 4.0.3, which was available 2017-04-30 -- 2017-12-19, and the version you happened use.
UPDATE 2018-01-21:
?weightedVar
clarifies that weightedVar(x = x, w = w) == Hmisc::wtd.var(x = x, weights = w)
, cf. commit 71032554.Just like in Hmisc (>= 4.1.0), weightedVar(x, w)
now produces a warning when the estimate is invalid:
> options(warn = 1)
> n <- 3
> x <- seq_len(n)
> w <- c(0.1, 0.2, 0.6)
> w2 <- w / min(w)
> w3 <- w / sum(w)
> matrixStats::weightedVar(x = x, w = w)
Warning in matrixStats::weightedVar(x = x, w = w) :
Produced invalid variance estimate, because the weights suggest at most one effective observation (sum(w) <= 1): -4.22222 (wsum = 0.9)
[1] -4.222222
> matrixStats::weightedVar(x = x, w = w2)
[1] 0.5277778
> matrixStats::weightedVar(x = x, w = w3)
Warning in matrixStats::weightedVar(x = x, w = w3) :
Produced invalid variance estimate, because the weights suggest at most one effective observation (sum(w) <= 1): Inf (wsum = 1)
[1] Inf
Maybe one could argue it should return NaN
or similar.
Interesting that Hmisc went through the same issue! I think this is a good solution for matrixStats.
I find it a little surprising that the default option of Hmisc::wtd.var()
is normwt = FALSE
which consequently means that scaling the weights can lead to (vastly) different estimates and even 'invalid' estimates. But clearly people who think about weighting will know better than me. What do you think about adding a normwt
-type argument to matrixStats::weightedVar()
?
Since the current implementation is no longer a "bug" per se, I'll push forward to release the next version of the package without further modifications to this function.
However, I'm certainly open to add support for alternative estimators here. We need to strive carefully though and understanding what we're adding since adding features is easy but removing them is complicated.
So, I think identifying and understand estimators is key. Then we can start adding them one by one. Adding support for the ones in Hmisc makes sense.
Sounds good. Do you have an ETA on the next release?
I'm doing the final cleanup (#119) and then I'll be running the rev dep checks on ~160 packages. If all go well I'll submit to CRAN after that.
matrixStats 0.53.0 is now on CRAN
Hi Henrik,
I encountered a situation where
matrixStats::colWeightedSds()
was giving meNaN
variances. After doing some digging it appears this is due tomatrixStats::weightedVar()
returning negative variance estimates. Here's a minimal reproducible example.I also compared
matrixStats::weightedVar()
toHmisc::wtd.var()
as my understanding is these should give the same result, right?Thanks, Pete
Session info
``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.4.1 (2017-06-30) #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> tz America/New_York #> date 2017-08-27 #> Packages ----------------------------------------------------------------- #> package * version date source #> acepack 1.4.1 2016-10-29 CRAN (R 3.4.0) #> backports 1.1.0 2017-05-22 CRAN (R 3.4.0) #> base * 3.4.1 2017-07-07 local #> base64enc 0.1-3 2015-07-28 CRAN (R 3.4.0) #> checkmate 1.8.3 2017-07-03 CRAN (R 3.4.1) #> cluster 2.0.6 2017-03-10 CRAN (R 3.4.1) #> colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0) #> compiler 3.4.1 2017-07-07 local #> data.table 1.10.4 2017-02-01 CRAN (R 3.4.0) #> datasets * 3.4.1 2017-07-07 local #> devtools 1.13.3 2017-08-02 CRAN (R 3.4.1) #> digest 0.6.12 2017-01-27 CRAN (R 3.4.0) #> evaluate 0.10.1 2017-06-24 CRAN (R 3.4.0) #> foreign 0.8-69 2017-06-22 CRAN (R 3.4.1) #> Formula 1.2-2 2017-07-10 CRAN (R 3.4.1) #> ggplot2 2.2.1 2016-12-30 CRAN (R 3.4.0) #> graphics * 3.4.1 2017-07-07 local #> grDevices * 3.4.1 2017-07-07 local #> grid 3.4.1 2017-07-07 local #> gridExtra 2.2.1 2016-02-29 CRAN (R 3.4.0) #> gtable 0.2.0 2016-02-26 CRAN (R 3.4.0) #> Hmisc 4.0-3 2017-05-02 CRAN (R 3.4.0) #> htmlTable 1.9 2017-01-26 CRAN (R 3.4.0) #> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0) #> htmlwidgets 0.9 2017-07-10 CRAN (R 3.4.0) #> knitr 1.17 2017-08-10 CRAN (R 3.4.1) #> lattice 0.20-35 2017-03-25 CRAN (R 3.4.1) #> latticeExtra 0.6-28 2016-02-09 CRAN (R 3.4.0) #> lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0) #> magrittr 1.5 2014-11-22 CRAN (R 3.4.0) #> Matrix 1.2-11 2017-08-16 CRAN (R 3.4.1) #> matrixStats 0.52.2 2017-04-14 CRAN (R 3.4.0) #> memoise 1.1.0 2017-05-26 Github (hadley/memoise@e372cde) #> methods * 3.4.1 2017-07-07 local #> munsell 0.4.3 2016-02-13 CRAN (R 3.4.0) #> nnet 7.3-12 2016-02-02 CRAN (R 3.4.1) #> plyr 1.8.4 2016-06-08 CRAN (R 3.4.0) #> RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.0) #> Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.0) #> rlang 0.1.2 2017-08-09 CRAN (R 3.4.1) #> rmarkdown 1.6 2017-06-15 CRAN (R 3.4.0) #> rpart 4.1-11 2017-03-13 CRAN (R 3.4.1) #> rprojroot 1.2 2017-01-16 CRAN (R 3.4.0) #> scales 0.4.1 2016-11-09 CRAN (R 3.4.0) #> splines 3.4.1 2017-07-07 local #> stats * 3.4.1 2017-07-07 local #> stringi 1.1.5 2017-04-07 CRAN (R 3.4.0) #> stringr 1.2.0 2017-02-18 CRAN (R 3.4.0) #> survival 2.41-3 2017-04-04 CRAN (R 3.4.1) #> tibble 1.3.4 2017-08-22 CRAN (R 3.4.1) #> tools 3.4.1 2017-07-07 local #> utils * 3.4.1 2017-07-07 local #> withr 2.0.0 2017-07-28 CRAN (R 3.4.1) #> yaml 2.1.14 2016-11-12 CRAN (R 3.4.0) ```