Open bbolker opened 6 years ago
*WARNING: brain dump.
I think this may be a much more general problem whenever there are complex/generated columns in the model (e.g. log(x)
, poly(x,2)
, splines::ns(x,5)
...). This should be documented and possibly throw a warning (if (sum(grepl("poly|ns",names(dataset))>0) warning("complex parameters are scaled individually; see 'Details' in ?by_2sd")
)
Suppose we have predictor variables (by which I mean variables that are present in the original data set) x
and y
and input variables (which are the variables that are constructed from the predictor variables, and present in the model matrix: these include dummy variables constructed from factors (irrelevant in this case because binary), spline and orthogonal polynomial basis vectors, transformed variables such as log(x)
, interaction variables ... terms like log()
, poly()
, and ns()
are in the model frame as well as the model matrix. In this case it's arguable that we ought to be scaling the predictor variables by their 2SD, rather than scaling the input variables one at a time.
To be concrete, if we have x
and y
(numeric) and x:y
, we ought to scale the x:y
parameter by (2*SD(x)*2*SD(y))
, not (2*SD(x*y))
(the latter is what the code currently does). A specific solution to this would be to form the product after scaling all the variables you could find in the data set.
arm::standardize
does scale predictor rather than input variables (it looks for variable names in the formula and standardizes them in the original data).poly(x,3)
(linear, quadratic, cubic terms) should be divided by (2*SD)^(1:3)
? I'm not sure about spline bases ... in contrast, log(x)
should be scaled by 2*SD(log(x))
; scaling by 2*SD(x)
would be wrong.Thanks for this, Ben. I've been vaguely thinking of this problem (without really devoting any time to it) since this tweet by @leeper last month: https://twitter.com/thosjleeper/status/997127610072100864
Thanks, too and especially, for actually doing something about it. I'll look over your PR tomorrow.
Thanks for merging the PR. Based on more conversation with my colleague JD, I think standardizing column-by-column may be fine (at least defensible and possibly even preferable to input-variable standardization). However, I think this warrants a bit more explication somewhere in the docs before closing this issue ...
Two misc thoughts:
recipes
should make it easy to specify whether you want to standardize before or after transforming predictors.ns
or poly
terms are present may be brittle. recipes
has some tools for this as well (although I don't recall what they are off the top of my head). Since spline expansions typically generate matrix columns, you may be able to do a more robust check with:
library(splines) library(purrr)
fit <- lm(mpg ~ ns(hp, 4), mtcars) mf <- model.frame(fit) pred_classes <- flatten_chr(map(mf, class)) pred_classes
"matrix" %in% pred_classes
Created on 2018-07-31 by the [reprex
package](http://reprex.tidyverse.org) (v0.2.0).
For what it's worth, Schielzeth 2010 ("Simple means to improve the interpretability ..." Methods in Ecology and Evolution says
Note that one should always transform the input variables and not the predictors (Gelman 2008).
I haven't followed this up by looking at Gelman's paper (Statistics in Medicine, "Scaling regression inputs") yet.
I am working with the development version of dotwhisker
with the https://github.com/fsolt/dotwhisker/pull/83 PR merged, but the by_2sd
function still doesn't seem to work for interactions.
set.seed(123)
# models
mod1 <- stats::lm(data = iris, formula = Sepal.Length ~ Sepal.Width)
mod2 <- stats::lm(data = iris, formula = Sepal.Length ~ Species)
mod3 <- stats::lm(data = iris, formula = Sepal.Length ~ Sepal.Width * Species)
# broom outputs
df1 <- broom::tidy(mod1)
df2 <- broom::tidy(mod2)
df3 <- broom::tidy(mod3)
# trying to get the standardized estimates
dotwhisker::by_2sd(df1, mod1$model)
#> # A tibble: 2 x 6
#> term estimate std.error statistic p.value by_2sd
#> <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 (Intercept) 6.53 0.479 13.6 6.47e-28 TRUE
#> 2 Sepal.Width -0.195 0.135 -1.44 1.52e- 1 TRUE
dotwhisker::by_2sd(df2, mod2$model)
#> # A tibble: 3 x 6
#> term estimate std.error statistic p.value by_2sd
#> <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 (Intercept) 5.01 0.0728 68.8 1.13e-113 TRUE
#> 2 Speciesversicolor 0.93 0.103 9.03 8.77e- 16 TRUE
#> 3 Speciesvirginica 1.58 0.103 15.4 2.21e- 32 TRUE
dotwhisker::by_2sd(df3, mod3$model)
#> Error in `[[<-.data.frame`(`*tmp*`, paste0(first, ":", second), value = numeric(0)): replacement has 0 rows, data has 150
Created on 2018-12-09 by the reprex package (v0.2.1)
tl;dr by_2sd
expects to receive a data frame where each of the terms in the tidy
object is represented by a column; as.data.frame(model.matrix(mod3))
satisfies this condition, but mod$model
(or better, model.frame(mod3)
) doesn't ...
I agree that this is unclear, but I think it's essentially a documentation issue (requiring a documentation patch). At present, by_2sd
standardizes predictor variables (i.e. columns of the model frame), rather than input variables (raw/observed values). There's lots of discussion in the thread above about whether this is the right thing to do, and what the technical challenges are in doing it either way. Note that dotwhisker::dwplot(mod3)
works fine, because dwplot
internally uses the following function to extract the data frame to use
get_dat <- function(x) {
tryCatch(as.data.frame(model.matrix(x)),
error=function(e) model.frame(x))
}
i.e., it tries to get the model matrix and falls back to the model frame if it's unavailable. (I don't know whether I argued it in the thread above, but it's probably a good idea to return an informative error message that detects when columns are missing from data
and gives the user some hints about what to do ...)
PS re @fsolt's 31 July comment; matrix-valued input "variables" are one issue, but this problem is probably more general ...
when there are interactions in the model,
by_2sd
tries to find the relevant columns of the model frame by parsing the names of the components in the interaction. This fails when contrasts have been set and the names are no longer as naively expected ...The example uses data from the
glmmTMB
package because it was what I was working with, but any model with an interaction (of two factors) should cause the same problem.I may work on the fix, but in the meantime I wanted to document this.