ebenmichael / augsynth

Augmented Synthetic Control Method
MIT License
148 stars 52 forks source link

Incomplete panel data causes segfault and R to abort #56

Open williamlief opened 3 years ago

williamlief commented 3 years ago

We've found that incomplete panel data causes a segfault and R to abort. This is related to #53 where missing values in the panel cause a segfault. Here we have determined that a panel data with missing rows (e.g. the entire row is absent) will also cause a segfault. Included here is a reproducible example of the behavior. I've also added an example of a simple function that could be added to the top of relevant augsynth and multisynth function calls so that an error is returned instead of crashing R.

library(augsynth)
library(dplyr)

kansas_2 <- kansas %>% 
  tidylog::filter(!(fips == 2 & year_qtr == 1995.25)) # drop an arbitrary row creating incomplete panel data

# # This will cause a segfault and R to abort
# syn <- augsynth(lngdpcapita ~ treated, fips, year_qtr, kansas_2,
#                 progfunc = "None", scm = T)

# Example function to check for completeness - could be expanded to check for missing 
# values or other issues. Catching and returning errors will make use much easier 
# than allowing the R session to abort.
check_data <- function(data, unit, time) {

  # Check whether there are omitted rows
  full_data <- data %>% tidyr::expand({{unit}}, {{time}})

  if(nrow(data) != nrow(full_data)) stop("There are missing rows in the input data set. Panel must be balanced.")

}

check_data(kansas, fips, year_qtr) # silent when no issue detected
check_data(kansas_2, fips, year_qtr)
ebenmichael commented 3 years ago

This looks great! I'll incorporate it. Or if you'd like, you could add this in to the single_augsynth function, add a test for it in the test_format.R file and make a pull request.

ebenmichael commented 2 years ago

Hi sorry for the long delay on this. I'm hoping to get around to this soon!