edunford / tidysynth

A tidy implementation of the synthetic control method in R
Other
98 stars 14 forks source link

Incorrect "plot_trends()" coming from "real_y": observations correspond to reversed years in "time_unit" #19

Closed Isaiahjv closed 2 years ago

Isaiahjv commented 2 years ago

Hello, thank you for your work on this package, it provides great aid. The issue raised here is serious, and thus ought to be corrected urgently. The issue makes the comparison between synthetic and observed completely wrong for some operations including the very important plot_trends(). See reproduction at the bottom for quick understanding.

When

The issue occurs whenever the time variable in the supplied data is in descending order for the corresponding unit and outcome (see reproduction code at the end).

Where

The issue is created in generate_control(), specifically aux_gen_control():

As a result of tidyr::spread below, the data will be in ascending order of the time_unitcolumn.

    # coutcome time series for the control units
    outcome_controls <-
      data %>%
      dplyr::filter(.placebo == is_placebo,.type=="controls") %>%
      dplyr::select(.original_data)  %>%
      tibble::as_tibble() %>%
      tidyr::unnest(cols = c(.original_data)) %>%
      dplyr::select(.data[[unit_index]],.data[[time_index]],.data[[outcome_name]]) %>%
      tidyr::spread(.data[[unit_index]],.data[[outcome_name]])

However , outcome treatment will be in descending order of time if the original data was in descending order:

outcome_treatment <-
      data %>%
      dplyr::filter(.placebo == is_placebo,.type=="treated") %>%
      dplyr::select(.original_data)  %>%
      tibble::as_tibble() %>%
      tidyr::unnest(cols = c(.original_data)) %>%
      dplyr::select(y=.data[[outcome_name]])

This makes synthetic_output force the real_y into the reversed time units from outcome_controls's ascending order.

synthetic_output <-
      (Y_U  %*%  W) %>%
      tibble::as_tibble() %>%
      dplyr::transmute(time_unit = outcome_controls[[time_index]],
                       real_y = outcome_treatment$y,
                       synth_y = weight)

Effect

grab_synthetic_control() and plot_trends are incorrect, and likely any object that uses .synthetic_control.

Reproduction

Using the package example and data

First sort units in descending order of the time unit:

smoking.rev <- smoking %>% group_by(state) %>% arrange(desc(year)) %>% ungroup()

Then using new data below, the the observed outcome correspond to reversed years evidenced by plot_trends() grab_synthetic_control()

smoking_out <-

  smoking.rev %>%

  # initial the synthetic control object
  synthetic_control(outcome = cigsale, # outcome
                    unit = state, # unit index in the panel data
                    time = year, # time index in the panel data
                    i_unit = "California", # unit where the intervention occurred
                    i_time = 1988, # time period when the intervention occurred
                    generate_placebos=T # generate placebo synthetic controls (for inference)
                    ) %>%

  # Generate the aggregate predictors used to fit the weights

  # average log income, retail price of cigarettes, and proportion of the
  # population between 15 and 24 years of age from 1980 - 1988
  generate_predictor(time_window = 1980:1988,
                     ln_income = mean(lnincome, na.rm = T),
                     ret_price = mean(retprice, na.rm = T),
                     youth = mean(age15to24, na.rm = T)) %>%

  # average beer consumption in the donor pool from 1984 - 1988
  generate_predictor(time_window = 1984:1988,
                     beer_sales = mean(beer, na.rm = T)) %>%

  # Lagged cigarette sales 
  generate_predictor(time_window = 1975,
                     cigsale_1975 = cigsale) %>%
  generate_predictor(time_window = 1980,
                     cigsale_1980 = cigsale) %>%
  generate_predictor(time_window = 1988,
                     cigsale_1988 = cigsale) %>%

  # Generate the fitted weights for the synthetic control
  generate_weights(optimization_window = 1970:1988, # time to use in the optimization task
                   margin_ipop = .02,sigf_ipop = 7,bound_ipop = 6 # optimizer options
  ) %>%

  # Generate the synthetic control
  generate_control()

smoking_out %>% plot_trends()
smoking_out %>% grab_synthetic_control()
etiennebacher commented 2 years ago

@edunford This is indeed a very important bug. I just add below all the code necessary to reproduce it (the code to make smoking.rev was missing in the first post), as well as the outputs:

suppressPackageStartupMessages({
  library(tidysynth)
  library(dplyr)
})

smoking.rev <- smoking %>% 
  arrange(-year) 

### Code from the README -------------------------------

smoking_out <- smoking.rev %>%

  # initial the synthetic control object
  synthetic_control(outcome = cigsale, # outcome
                    unit = state, # unit index in the panel data
                    time = year, # time index in the panel data
                    i_unit = "California", # unit where the intervention occurred
                    i_time = 1988 # time period when the intervention occurred
  ) %>%

  # Generate the aggregate predictors used to fit the weights

  # average log income, retail price of cigarettes, and proportion of the
  # population between 15 and 24 years of age from 1980 - 1988
  generate_predictor(time_window = 1980:1988,
                     ln_income = mean(lnincome, na.rm = T),
                     ret_price = mean(retprice, na.rm = T),
                     youth = mean(age15to24, na.rm = T)) %>%

  # average beer consumption in the donor pool from 1984 - 1988
  generate_predictor(time_window = 1984:1988,
                     beer_sales = mean(beer, na.rm = T)) %>%

  # Lagged cigarette sales 
  generate_predictor(time_window = 1975,
                     cigsale_1975 = cigsale) %>%
  generate_predictor(time_window = 1980,
                     cigsale_1980 = cigsale) %>%
  generate_predictor(time_window = 1988,
                     cigsale_1988 = cigsale) %>%

  # Generate the fitted weights for the synthetic control
  generate_weights(optimization_window = 1970:1988, # time to use in the optimization task
                   margin_ipop = .02,sigf_ipop = 7,bound_ipop = 6 # optimizer options
  ) %>%

  # Generate the synthetic control
  generate_control()

smoking_out %>% grab_synthetic_control()
#> # A tibble: 31 × 3
#>    time_unit real_y synth_y
#>        <dbl>  <dbl>   <dbl>
#>  1      1970   41.6    146.
#>  2      1971   47.2    150.
#>  3      1972   52.3    151.
#>  4      1973   53.8    150.
#>  5      1974   54.5    151.
#>  6      1975   56.4    151.
#>  7      1976   58.6    157.
#>  8      1977   63.4    157.
#>  9      1978   67.5    156.
#> 10      1979   68.7    152.
#> # … with 21 more rows
#> # ℹ Use `print(n = ...)` to see more rows
smoking_out %>% plot_trends()

Created on 2022-08-30 by the reprex package (v2.0.1)

edunford commented 2 years ago

Fantastic work pointing this out! Thank you to you both @Isaiahjv and @etiennebacher. bddc3b9 should resolve this issue. There was no ordering assurances when the data was first ingested and stored, but now there are!