fsolt / dotwhisker

Dot-and-Whisker Plots of Regression Results
https://fsolt.org/dotwhisker/
Other
57 stars 10 forks source link

by_2sd() equivalent to arm::standardize()? #68

Closed bgall closed 7 years ago

bgall commented 7 years ago

dotwhisker::by_2sd() multiplies the coefficients by two standard deviations of the raw data. Both the CRAN documentation and the documentation here on Github link to the Gelman (2008) article, but the Gelman article actually divides by two standard deviations (the title of the article is "Scaling regression inputs by dividing by two standard deviations"). It would be helpful to add a few words to the documentation noting the difference in comparison to the paper to avoid others needlessly checking of the source code to confirm it is multiplying although the paper divides.

stefan-mueller commented 7 years ago

This is actually a very good point, I encountered this issue yesterday as well. I don't think dotwhisker::by_2sd() does the same as the standardize() method recommended by Gelman (2008).

Let's run a simple example.

First, I use Gelman's method who applies the following transformation (subtracting the mean and dividing by 2 standard deviations – see for this exact formula also Gelman and Hill (2007: 56)).

library(dotwhisker)
library(broom)
library(dplyr)

mtcars <- mtcars %>% 
  mutate(disp_by_2sd_divided = (disp - mean(disp)) / (2 * sd(disp)))

# Regression using the new variable 
reg1_by_2sd_manual <- lm(mpg ~ disp_by_2sd_divided + factor(cyl), data = mtcars) %>% 
  tidy()

Now do the same with dotwhisker::by_2sd:

# Regression with by_2sd() function
reg2 <- lm(mpg ~ disp + factor(cyl), data = mtcars)
reg2_by_2sd_dotwhisker <- tidy(reg2) %>% dotwhisker::by_2sd(mtcars)

Let's compare the output.

reg1_by_2sd_manual
#                term  estimate std.error statistic      p.value
# 1        (Intercept) 23.234066  1.601981 14.503335 1.510070e-14
# 2 disp_by_2sd_divided -6.769195  2.629915 -2.573921 1.563814e-02
# 3       factor(cyl)6 -4.785846  1.649823 -2.900825 7.168524e-03
# 4       factor(cyl)8 -4.792086  2.886818 -1.659989 1.080770e-01

reg2_by_2sd_dotwhisker
#          term  estimate std.error statistic      p.value
# 1  (Intercept) 29.534768  1.426621 20.702610 1.636895e-18
# 2         disp -6.769195  2.629915 -2.573921 1.563814e-02
# 3 factor(cyl)6 -4.785846  1.649823 -2.900825 7.168524e-03
# 4 factor(cyl)8 -4.792086  2.886818 -1.659989 1.080770e-01

While both methods give the same estimate for the independent variables, the values of the intercept differ. When I apply Gelman's standardize function from the arm package, I get the result of the reg1_by_2sd_manual regression, and not the reg2_by_2sd_dotwhisker result.

# > arm::display(standardize(reg2))
# lm(formula = mpg ~ z.disp + factor(cyl), data = mtcars)
#             coef.est coef.se
# (Intercept)  23.23     1.60  
# z.disp       -6.77     2.63  
# factor(cyl)6 -4.79     1.65  
# factor(cyl)8 -4.79     2.89  
# ---
# n = 32, k = 4
# residual sd = 2.95, R-Squared = 0.78

So dotwhisker::by_2sd() does obviously not replicate standardize(). I think this problem should be addressed. Does dotwhisker::by_2sd() have any advantage over Gelman's calculation? Otherwise, we might think about changing the formula to Gelman's proposed transformation in the dotwhisker package to avoid confusion.

fsolt commented 7 years ago

Thanks, @stefan-mueller, for working up an example. It makes clear two points:

  1. Multiplying the coefficient obtained in a regression for a variable is mathematically equivalent to dividing the variable before entering it into the regression. Of course this is true: If x1 equals x/(2*sd(x)), then, for x and x1 to predict the same change in y (which they must, for x1 simply rescales x), the coefficient of x1 has to be 2*sd(x) times bigger than the coefficient of x. (Sorry, this would be even easier to make clear if I could insert some LaTeX math here, but apparently that's not straightforward.)
  2. But, because by_2sd() does not also subtract the mean as standardize() does, the intercept is shifted by the coefficient times the mean of the variable (for all variables in the model). I tend to think the intercept isn't interesting and so I exclude it from my plots anyway, but I can imagine that maybe there's a context where it makes sense to care about the intercept, and it also makes sense forby_2sd() to mimic standardize() as closely as possible. I'll totally welcome a PR from anyone who wants to implement that. 😃
stefan-mueller commented 7 years ago

Thanks for your reply, this makes sense. I slightly extended the details section in by_2sd.R to show the difference to arm::standardize (see PR). If I find time, I might change the implementation, but as you say, I don't know whether this is really necessary as the intercept is not important in most cases.

At least we have now shown the difference to standardize and highlighted that our estimates, standard errors, and p-values are the exact same for all IVs as if we used Gelman's (2008) formula.