edunford / tidysynth

A tidy implementation of the synthetic control method in R
Other
98 stars 14 forks source link

na.rm issue #5

Closed EdiTerlaak closed 3 years ago

EdiTerlaak commented 3 years ago

The tidysynth package works smoothly for me, except for one issue. For one particular variable, I get this error:

Error in aux_gen_pred(data = ., time_window = time_window, ...) : NA generated in specified predictors. Consider using rm.na= TRUE in aggregation function or specifying a larger/different time window

What is meant is na.rm=T, I think. I did specify this though. The thing is, it works for all the other variables. Just not for 'imports'. I have no idea why.

For a reproducible example:

The data set cannot be attached due to the format. It can be donwloaded here though https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/I42PPJ

load("afripanel_wdk_final.RData") subsah_out <-

afripanel %>%

initial the synthetic control object

synthetic_control(outcome = lngdpmad, # outcome unit = WBCode, # unit index in the panel data time = year, # time index in the panel data i_unit = "MLI", # unit where the intervention occurred i_time = 1991, # time period when the intervention occurred generate_placebos=T # generate placebo synthetic controls (for inference) ) %>%

Generate the aggregate predictors used to fit the weights

average log income, retail price of cigarettes, and proportion of the

population between 15 and 24 years of age from 1980 - 1988

generate_predictor(time_window = 1975:1990, eximd = mean(imports, na.rm=T)

) %>%

Generate the fitted weights for the synthetic control

generate_weights(optimization_window = 1975:1991, # time to use in the optimization task margin_ipop = .02,sigf_ipop = 7,bound_ipop = 6 # optimizer options ) %>%

Generate the synthetic control

generate_control()

edunford commented 3 years ago

The issue seems to be that for some units in your data (e.g. SDN and ETH for years 1975 through 1990), all values are missing from the import variable. As a result, this cell in the optimization matrix will be NA for those units and will throw the error (typo included! 🤦‍♂️).

Here you can see that you're missing data for those units.

afripanel %>% 
  filter(year >= 1975 & year <= 1990) %>% 
  group_by(WBCode) %>% 
  summarize(prop_missing = sum(is.na(imports)/n())) %>% 
  arrange(desc(prop_missing))

One solution would be to drop those units or use another (more complete) variable when generating the predictor matrix.

require(tidysynth)
load("~/Desktop/afripanel_wdk_final.RData")

subsah_out <-

  afripanel %>% 

  filter(!(WBCode %in% c("SDN","ETH"))) %>% 

  synthetic_control(outcome = lngdpmad, # outcome
                    unit = WBCode, # unit index in the panel data
                    time = year, # time index in the panel data
                    i_unit = "MLI", # unit where the intervention occurred
                    i_time = 1991, # time period when the intervention occurred
                    generate_placebos=T # generate placebo synthetic controls (for inference)
  ) %>%

  generate_predictor(time_window = 1975:1990,
                     eximd = mean(imports, na.rm=T)

  ) %>%

  generate_weights(optimization_window = 1975:1991, # time to use in the optimization task
                   margin_ipop = .02,sigf_ipop = 7,bound_ipop = 6 # optimizer options
  ) %>%

  generate_control()
EdiTerlaak commented 3 years ago

Thanks for the quick explanation!