bcallaway11 / did

Difference in Differences with Multiple Periods, website: https://bcallaway11.github.io/did
288 stars 92 forks source link

Issues with clustering #93

Closed anjoung closed 2 years ago

anjoung commented 2 years ago

I am using the DID package on an unbalanced panel dataset. I am trying to cluster at the county-level and am running into some problems. I run the following code:

atts <- att_gt(yname = "beds", 
                 tname = "year", 
                 idname = "id", 
                 gname = "treat_year", 
                 data = cleaned_data, 
                 xformla = NULL, 
                 est_method = "dr", 
                 control_group = "nevertreated",  
                 clustervars = "county_id", 
                 base_period = "universal",
                 panel = TRUE,
                 allow_unbalanced_panel = TRUE,
                 print_details = FALSE) # if TRUE, print detailed results

When I run the above code I get the following error:

Error in rowsum.default(inf.func, cluster, reorder = TRUE) : 
  incorrect length for 'group'

A few notes. First, I originally did this in Stata, but am re-doing this in R and was able to cluster at the county-level without issues in Stata. Second, when I set "allow_unbalanced_panel = FALSE", then, I can cluster at the county-level. Third, when I set "allow_unbalanced_panel = TRUE", but drop the cluster (i.e., use the default of clustering only at id-level), then, I can run everything fine.

Any thoughts on why this is appearing would be appreciated! Also, would appreciate if you just directed me to the section of code where this error is originating from, so I can try to figure it out myself.

bcallaway11 commented 2 years ago

Hi Andrew, I think this is a genuine bug. I should be able to fix this soon. In the meantime, this bug seems to have been introduced in our new code (version 2.1), but worked fine in (version 2.0). I'll keep you posted once I get this fixed.

Here is some code that can generate this error:

library(did)

# build simulated panel data
sp <- reset.sim()
data <- build_sim_dataset(sp)

# drop 1 observation to make unbalanced panel
data <- data[-3,]

# confirm error allowing for unbalanced panel
res1 <- att_gt(yname="Y",
              tname="period",
              idname="id",
              gname="G",
              xformla=~X,
              data=data,
              panel=TRUE,
              allow_unbalanced_panel=TRUE,
              clustervars="cluster")
#> Error in rowsum.default(inf.func, cluster, reorder = TRUE): incorrect length for 'group'

# for balanced panel
res2 <- att_gt(yname="Y",
              tname="period",
              idname="id",
              gname="G",
              xformla=~X,
              data=data,
              panel=TRUE,
              allow_unbalanced_panel=FALSE,
              clustervars="cluster")
#> Warning in pre_process_did(yname = yname, tname = tname, idname = idname, :
#> Dropped 1 observations while converting to balanced panel.

# try old version of code
detach("package:did")
library(did, lib.loc = "~/R/old_packages/")
res3 <- att_gt(yname="Y",
               tname="period",
               idname="id",
               gname="G",
               xformla=~X,
               data=data,
               panel=TRUE,
               allow_unbalanced_panel=TRUE,
               clustervars="cluster")
anjoung commented 2 years ago

Phew! That's a relief for me. Thanks for doing this amazing work! Look forward to the fix.

bcallaway11 commented 2 years ago

Hi Andrew,

I think this should be fixed now, if you will update to version 2.1.1 from GitHub. I'm pretty sure that this fixes everything, but keep me posted.

Demonstrate that previous example now works

# devtools::install_github("bcallaway11/did")
library(did)
packageVersion("did")
#> [1] '2.1.1'

# build simulated panel data
sp <- reset.sim()
data <- build_sim_dataset(sp)

# drop 1 observation to make unbalanced panel
data <- data[-3,]

# confirm error allowing for unbalanced panel
res1 <- att_gt(yname="Y",
              tname="period",
              idname="id",
              gname="G",
              xformla=~X,
              data=data,
              panel=TRUE,
              allow_unbalanced_panel=TRUE,
              clustervars="cluster")
res1
#> 
#> Call:
#> att_gt(yname = "Y", tname = "period", idname = "id", gname = "G", 
#>     xformla = ~X, data = data, panel = TRUE, allow_unbalanced_panel = TRUE, 
#>     clustervars = "cluster")
#> 
#> Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 
#> 
#> Group-Time Average Treatment Effects:
#>  Group Time ATT(g,t) Std. Error [95% Simult.  Conf. Band]  
#>      2    2   0.9309     0.0781        0.7328      1.1290 *
#>      2    3   0.9904     0.0814        0.7840      1.1967 *
#>      2    4   0.9670     0.0728        0.7825      1.1514 *
#>      3    2   0.0090     0.0705       -0.1696      0.1875  
#>      3    3   1.0216     0.0726        0.8376      1.2056 *
#>      3    4   1.0270     0.0777        0.8301      1.2239 *
#>      4    2   0.0066     0.0751       -0.1837      0.1969  
#>      4    3   0.0057     0.0698       -0.1713      0.1827  
#>      4    4   1.0427     0.0818        0.8353      1.2500 *
#> ---
#> Signif. codes: `*' confidence band does not cover 0
#> 
#> P-value for pre-test of parallel trends assumption:  0.9977
#> Control Group:  Never Treated,  Anticipation Periods:  0
#> Estimation Method:  Doubly Robust
anjoung commented 2 years ago

Everything seems to be running smoothly now! Thanks a ton for the quick turnaround. Will ping again if something comes up.

bcallaway11 commented 2 years ago

Sounds good, I'm going to close this one, but just let me know if you run into anything.