fsolt / dotwhisker

Dot-and-Whisker Plots of Regression Results
https://fsolt.org/dotwhisker/
Other
57 stars 10 forks source link

by_2sd fails when factor levels contain : #96

Closed LukasWallrich closed 2 years ago

LukasWallrich commented 3 years ago

Thanks for this great package.

I just spent a long time troubleshooting a silly issue: in my data, some factor levels included :, such as Christian: Protestant - which then led to by_2sd failing because it tried to decompose the interaction.

In the end, I figured that out by looking at the code. There is probably no way to distinguish between such factor levels and actual interaction terms in tidy data frames, but it might be worth adding that as a note to the documentation? Or to throw an informative error that suggests that as a possibility?

sammo3182 commented 2 years ago

Lukas @LukasWallrich , sorry for the late late reply: dealing with interactions is a little bit tricky since it creates new variable names. But at least based on my simple try, by_2sd still works:

library(dotwhisker)

mod <- lm(mpg ~ wt + as.factor(cyl) + as.factor(gear), data = mtcars)

# draw a dot-and-whisker plot
dwplot(mod, by_2sd = TRUE)

image

An alternative way is to rescale your data before the analysis with arm::rescale and run the regressions, although I doubt if that is necessary for factors. Free to reopen the issue if the relevant problem still troubles you. Also regarding interactions, a better way to interpret them is using marginal effects. Another package of ours interplot can help you to do that easily.

LukasWallrich commented 2 years ago

@sammo3182 thanks for getting back to this - I should have include a reprex. I was referring to the by_2sd() function, where I ran into the problem when factor levels include : - see below. dwplot() works as it does not refer back to the original data.

library(dotwhisker)
#> Loading required package: ggplot2
library(broom)
library(magrittr)

## Fails
df1 <- data.frame(faith = factor(c(rep("Christian: catholic", 5), rep("Christian: protestant", 5), rep("Muslim", 5))),
                    religiosity = rnorm(15),
                    prayer =  rnorm(15))

mod1 <- lm(prayer ~ religiosity + faith, df1)

tidy(mod1) %>% by_2sd(df1)
#> Error in `[[<-.data.frame`(`*tmp*`, paste0(first, ":", second), value = integer(0)): replacement has 0 rows, data has 15

## Works

df2 <- data.frame(faith = factor(c(rep("Christian catholic", 5), rep("Christian protestant", 5), rep("Muslim", 5))),
                  religiosity = rnorm(15),
                  prayer =  rnorm(15))
mod2 <- lm(prayer ~ religiosity + faith, df2)

tidy(mod2) %>% by_2sd(df2)
#> # A tibble: 4 x 6
#>   term                      estimate std.error statistic p.value by_2sd
#>   <chr>                        <dbl>     <dbl>     <dbl>   <dbl> <lgl> 
#> 1 (Intercept)                 0.0716     0.608     0.118   0.908 TRUE  
#> 2 religiosity                -0.242      0.738    -0.329   0.749 TRUE  
#> 3 faithChristian protestant  -0.505      0.870    -0.581   0.573 TRUE  
#> 4 faithMuslim                 0.591      0.847     0.698   0.500 TRUE

Created on 2021-07-26 by the reprex package (v2.0.0)

sammo3182 commented 2 years ago

Ah, I see. As I previously mentioned, factors are indeed trickier than numeric variables. It requires a special capturing action when new variable names are created. We'll definitely consider reprogramming the function to fit the demand if using by_2sd on factor variables becomes a common request. Before that, you can use the dwplot to work around even if you don't want to visualize but just save the numeric results. Here's an example:

library(dotwhisker)
library(broom)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df1 <- data.frame(
  faith = factor(c(rep("Christian: catholic", 5), rep("Christian: protestant", 5), rep("Muslim", 5))),
  religiosity = rnorm(15),
  prayer = rnorm(15)
)

mod1 <- lm(prayer ~ religiosity + faith, df1)
output <- dwplot(mod1, show_intercept = TRUE, by_2sd = TRUE)$data
output
#>                         term   estimate std.error conf.level   conf.low
#> 1                (Intercept) 0.00000000 0.0000000       0.95  0.0000000
#> 2                religiosity 0.06739025 0.7580993       0.95 -1.6011751
#> 3 faithChristian: protestant 0.74293251 0.8884827       0.95 -1.2126048
#> 4                faithMuslim 0.69714253 0.6256442       0.95 -0.6798911
#>   conf.high   statistic df.error   p.value by_2sd model y_ind
#> 1  0.000000 -1.01159124       11 0.3334679   TRUE   one     4
#> 2  1.735956  0.08889369       11 0.9307642   TRUE   one     3
#> 3  2.698470  0.83618115       11 0.4208462   TRUE   one     2
#> 4  2.074176  1.11427950       11 0.2889136   TRUE   one     1