edunford / tidysynth

A tidy implementation of the synthetic control method in R
Other
99 stars 14 forks source link

Most of `grab_` and `plot_` functions don't work if `generate_placebos = FALSE` #20

Closed etiennebacher closed 2 years ago

etiennebacher commented 2 years ago

Hello, I think I found a bug that is present on the Github version but not on CRAN version. When I use generate_placebos = FALSE in synthetic_control(), then most of the grab_ and plot_ functions don't work:

library(tidysynth)

smoking_out <- smoking %>%
  synthetic_control(
    outcome = cigsale, 
    unit = state, 
    time = year,
    i_unit = "California", 
    i_time = 1988, 
    generate_placebos = FALSE 
  ) %>%

  generate_predictor(
    time_window = 1980:1988,
    ln_income = mean(lnincome, na.rm = T)
  ) %>%
  generate_weights(optimization_window = 1970:1988) %>%
  generate_control()

smoking_out %>% grab_outcome()
#> # A tibble: 0 × 1
#> # … with 1 variable: .outcome <???>
#> # ℹ Use `colnames()` to see all variable names

smoking_out %>% grab_predictors()
#> # A tibble: 0 × 1
#> # … with 1 variable: .predictors <???>
#> # ℹ Use `colnames()` to see all variable names

smoking_out %>% grab_unit_weights()
#> # A tibble: 0 × 1
#> # … with 1 variable: .unit_weights <???>
#> # ℹ Use `colnames()` to see all variable names

smoking_out %>% grab_predictor_weights()
#> # A tibble: 0 × 1
#> # … with 1 variable: .predictor_weights <???>
#> # ℹ Use `colnames()` to see all variable names

smoking_out %>% grab_loss()
#> # A tibble: 1 × 4
#>   .id     .placebo variable_mspe control_unit_mspe
#>   <chr>      <dbl>         <dbl>             <dbl>
#> 1 Alabama        1          149.         0.0000257

smoking_out %>% grab_significance()
#> # A tibble: 1 × 8
#>   unit_name type  pre_mspe post_mspe mspe_ratio  rank fishers_exact_pv…¹ z_score
#>   <chr>     <chr>    <dbl>     <dbl>      <dbl> <int>              <dbl>   <dbl>
#> 1 Alabama   Donor     149.      11.1     0.0744     1                  1      NA
#> # … with abbreviated variable name ¹​fishers_exact_pvalue

smoking_out %>% grab_balance_table()
#> Error in `chr_as_locations()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `variable` doesn't exist.

smoking_out %>% grab_synthetic_control()
#> # A tibble: 0 × 1
#> # … with 1 variable: .synthetic_control <???>
#> # ℹ Use `colnames()` to see all variable names

smoking_out %>% plot_trends()
#> Error in `dplyr::filter()`:
#> ! Problem while computing `..1 = time_unit %in% time_window`.
#> Caused by error in `time_unit %in% time_window`:
#> ! object 'time_unit' not found

smoking_out %>% plot_differences()
#> Error in `dplyr::mutate()`:
#> ! Problem while computing `diff = real_y - synth_y`.
#> Caused by error in `mask$eval_all_mutate()`:
#> ! object 'real_y' not found

smoking_out %>% plot_weights()
#> Error in `chr_as_locations()`:
#> ! Can't rename columns that don't exist.
#> ✖ Column `variable` doesn't exist.

Created on 2022-08-31 by the reprex package (v2.0.1)

Everything works correctly if generate_placebos = TRUE but it makes the function slower and placebos shouldn't be needed for these functions.


Besides this bug, even the output of generate_control() seems weird when generate_placebos = FALSE:

library(tidysynth)

smoking_out <- smoking %>%
  synthetic_control(
    outcome = cigsale, 
    unit = state, 
    time = year,
    i_unit = "California", 
    i_time = 1988, 
    generate_placebos = FALSE 
  ) %>%

  generate_predictor(
    time_window = 1980:1988,
    ln_income = mean(lnincome, na.rm = T)
  ) %>%
  generate_weights(optimization_window = 1970:1988) %>%
  generate_control()

smoking_out
#> # A tibble: 2 × 11
#>   .id     .placebo .type   .outcome .predi…¹ .synth…² .unit_…³ .predi…⁴ .origi…⁵
#>   <chr>      <dbl> <chr>   <list>   <list>   <list>   <list>   <list>   <list>  
#> 1 Alabama        1 treated <tibble> <tibble> <tibble> <tibble> <tibble> <tibble>
#> 2 Alabama        1 contro… <tibble> <tibble> <tibble> <tibble> <tibble> <tibble>
#> # … with 2 more variables: .meta <list>, .loss <list>, and abbreviated variable
#> #   names ¹​.predictors, ²​.synthetic_control, ³​.unit_weights,
#> #   ⁴​.predictor_weights, ⁵​.original_data
#> # ℹ Use `colnames()` to see all variable names

Created on 2022-08-31 by the reprex package (v2.0.1) Can you confirm that it works as expected?

etiennebacher commented 2 years ago

I think the problem comes from the following lines in synthetic_control.data.frame():

https://github.com/edunford/tidysynth/blob/afd112a57f9716bf649f98ebb1dab68d66e9da00/R/main.R#L222-L240

The problem is the break call in the for loop because it assumes that the treated unit comes first in the data. This is not true because before the loop you arrange by iso and then by placebo value, whereas it should be by placebo value first (so that the treated country comes first) and then by iso.

Bottom line, I think the fix should be to replace: https://github.com/edunford/tidysynth/blob/afd112a57f9716bf649f98ebb1dab68d66e9da00/R/main.R#L216 by dplyr::arrange(placebo, !!unit) %>%