fsolt / dotwhisker

Dot-and-Whisker Plots of Regression Results
https://fsolt.org/dotwhisker/
Other
57 stars 10 forks source link

by_2sd fails with model including interactions and non-standard contrasts #82

Open bbolker opened 6 years ago

bbolker commented 6 years ago

when there are interactions in the model, by_2sd tries to find the relevant columns of the model frame by parsing the names of the components in the interaction. This fails when contrasts have been set and the names are no longer as naively expected ...

The example uses data from the glmmTMB package because it was what I was working with, but any model with an interaction (of two factors) should cause the same problem.

library(dotwhisker)
data(Owls,package="glmmTMB")
owls_lm1 <- lm(SiblingNegotiation~FoodTreatment*SexParent,
               contrasts=list(FoodTreatment="contr.sum",
                              SexParent="contr.sum"),
               Owls)
dwplot(list(owls_lm1))  ## failure
t1 <- broom::tidy(owls_lm1, conf.int = TRUE)
dotwhisker::by_2sd(t1,model.frame(owls_lm1))
##  Error in `[[<-.data.frame`(`*tmp*`, paste0(first, ":", second), value = integer(0)) : 
  replacement has 0 rows, data has 599

I may work on the fix, but in the meantime I wanted to document this.

bbolker commented 6 years ago

*WARNING: brain dump.

I think this may be a much more general problem whenever there are complex/generated columns in the model (e.g. log(x), poly(x,2), splines::ns(x,5) ...). This should be documented and possibly throw a warning (if (sum(grepl("poly|ns",names(dataset))>0) warning("complex parameters are scaled individually; see 'Details' in ?by_2sd"))

Suppose we have predictor variables (by which I mean variables that are present in the original data set) x and y and input variables (which are the variables that are constructed from the predictor variables, and present in the model matrix: these include dummy variables constructed from factors (irrelevant in this case because binary), spline and orthogonal polynomial basis vectors, transformed variables such as log(x), interaction variables ... terms like log(), poly(), and ns() are in the model frame as well as the model matrix. In this case it's arguable that we ought to be scaling the predictor variables by their 2SD, rather than scaling the input variables one at a time.

To be concrete, if we have x and y (numeric) and x:y, we ought to scale the x:y parameter by (2*SD(x)*2*SD(y)), not (2*SD(x*y)) (the latter is what the code currently does). A specific solution to this would be to form the product after scaling all the variables you could find in the data set.

fsolt commented 6 years ago

Thanks for this, Ben. I've been vaguely thinking of this problem (without really devoting any time to it) since this tweet by @leeper last month: https://twitter.com/thosjleeper/status/997127610072100864

Thanks, too and especially, for actually doing something about it. I'll look over your PR tomorrow.

bbolker commented 5 years ago

Thanks for merging the PR. Based on more conversation with my colleague JD, I think standardizing column-by-column may be fine (at least defensible and possibly even preferable to input-variable standardization). However, I think this warrants a bit more explication somewhere in the docs before closing this issue ...

alexpghayes commented 5 years ago

Two misc thoughts:

  1. recipes should make it easy to specify whether you want to standardize before or after transforming predictors.
  2. Parsing column names to determine if ns or poly terms are present may be brittle. recipes has some tools for this as well (although I don't recall what they are off the top of my head). Since spline expansions typically generate matrix columns, you may be able to do a more robust check with:

library(splines) library(purrr)

fit <- lm(mpg ~ ns(hp, 4), mtcars) mf <- model.frame(fit) pred_classes <- flatten_chr(map(mf, class)) pred_classes

> [1] "numeric" "ns" "basis" "matrix"

"matrix" %in% pred_classes

> [1] TRUE



Created on 2018-07-31 by the [reprex
package](http://reprex.tidyverse.org) (v0.2.0).
bbolker commented 5 years ago

For what it's worth, Schielzeth 2010 ("Simple means to improve the interpretability ..." Methods in Ecology and Evolution says

Note that one should always transform the input variables and not the predictors (Gelman 2008).

I haven't followed this up by looking at Gelman's paper (Statistics in Medicine, "Scaling regression inputs") yet.

IndrajeetPatil commented 5 years ago

I am working with the development version of dotwhisker with the https://github.com/fsolt/dotwhisker/pull/83 PR merged, but the by_2sd function still doesn't seem to work for interactions.

set.seed(123)

# models
mod1 <- stats::lm(data = iris, formula = Sepal.Length ~ Sepal.Width)
mod2 <- stats::lm(data = iris, formula = Sepal.Length ~ Species)
mod3 <- stats::lm(data = iris, formula = Sepal.Length ~ Sepal.Width * Species)

# broom outputs
df1 <- broom::tidy(mod1)
df2 <- broom::tidy(mod2)
df3 <- broom::tidy(mod3)

# trying to get the standardized estimates
dotwhisker::by_2sd(df1, mod1$model)
#> # A tibble: 2 x 6
#>   term        estimate std.error statistic  p.value by_2sd
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl> <lgl> 
#> 1 (Intercept)    6.53      0.479     13.6  6.47e-28 TRUE  
#> 2 Sepal.Width   -0.195     0.135     -1.44 1.52e- 1 TRUE

dotwhisker::by_2sd(df2, mod2$model)
#> # A tibble: 3 x 6
#>   term              estimate std.error statistic   p.value by_2sd
#>   <chr>                <dbl>     <dbl>     <dbl>     <dbl> <lgl> 
#> 1 (Intercept)           5.01    0.0728     68.8  1.13e-113 TRUE  
#> 2 Speciesversicolor     0.93    0.103       9.03 8.77e- 16 TRUE  
#> 3 Speciesvirginica      1.58    0.103      15.4  2.21e- 32 TRUE

dotwhisker::by_2sd(df3, mod3$model)
#> Error in `[[<-.data.frame`(`*tmp*`, paste0(first, ":", second), value = numeric(0)): replacement has 0 rows, data has 150

Created on 2018-12-09 by the reprex package (v0.2.1)

bbolker commented 5 years ago

tl;dr by_2sd expects to receive a data frame where each of the terms in the tidy object is represented by a column; as.data.frame(model.matrix(mod3)) satisfies this condition, but mod$model (or better, model.frame(mod3)) doesn't ...

I agree that this is unclear, but I think it's essentially a documentation issue (requiring a documentation patch). At present, by_2sd standardizes predictor variables (i.e. columns of the model frame), rather than input variables (raw/observed values). There's lots of discussion in the thread above about whether this is the right thing to do, and what the technical challenges are in doing it either way. Note that dotwhisker::dwplot(mod3) works fine, because dwplot internally uses the following function to extract the data frame to use

get_dat <- function(x) {
        tryCatch(as.data.frame(model.matrix(x)),
                 error=function(e) model.frame(x))
}

i.e., it tries to get the model matrix and falls back to the model frame if it's unavailable. (I don't know whether I argued it in the thread above, but it's probably a good idea to return an informative error message that detects when columns are missing from data and gives the user some hints about what to do ...)

bbolker commented 5 years ago

PS re @fsolt's 31 July comment; matrix-valued input "variables" are one issue, but this problem is probably more general ...