bcallaway11 / did

Difference in Differences with Multiple Periods, website: https://bcallaway11.github.io/did
288 stars 92 forks source link

Difference in the ATT versions 2.0.0 and 2.1.1. ALLOW_UNBALANCED_PANEL not changing ATT? #124

Closed VeronicaCPerez closed 2 years ago

VeronicaCPerez commented 2 years ago

Hi,

I've been running a few did estimations in the past month using version 2.1.1, however, someone else ran the same estimations last year, with the same database, but using version 2.0.0, and the results where different.

My results differ from version 2.0.0 and 2.1.1:

After playing with the parameters a bit, I found that the difference comes from changing the parameter "allow_unbalance_panel" in version 2.1.1. Therefore, I estimated the ATT with the 4 combinations of True and False for the following parameters: "PANEL" and "allow_unbalance_panel". I obtained the following ATT (for the same sample), all other parameters were the same.

Version 2.0.0 (ATT in the cells) allow_unbalanced_panel allow_unbalanced_panel
True False
Panel True 0.4464 (error)
Panel False 0.4464 0.4464
Version 2.1.1 (ATT in the cells) allow_unbalanced_panel allow_unbalanced_panel
True False
Panel True 0.334 0.4333
Panel False 0.4464 0.4464

What I understood after checking the code:

I checked the code, att_gt.R, pre_process_did.R; and compute.att_gt.R. I compared the source code from versions 2.1.1 and 2.0.0 to try to understand why the ATTs where different and I describe here what I understood (however, I am not an expert in R, so apologies if I'm misunderstanding any coding)

My understanding is that as soon as I establish panel = False , allow_unbalanced_panel will not matter because the dataset will be treated as a cross_section. I also understand that Panel = True and allow_unbalanced_panel = True; will switch Panel = False (but true_cross_section = False, different from the real cross section). This is from lines 171-179 from pre_process_did.R.

Now, this means that the data will also go through the part of the code that treats cross_sections (line 221 pre_process_did.R). Then we have 1. real cross sections and 2. unbalanced panels going through this section. The only moment where I find a different coding for these two types of dataset is in the following lines (237 and 244 pre_process_did.R) :

    # n-row data.frame to hold the influence function
    if (true_repeated_cross_sections) {
      data$.rowid <- seq(1:nrow(data))
      idname <- ".rowid"
    } else {
      # set rowid to idname for repeated cross section/unbalanced
      data$.rowid <- data[, idname]
    }

Which makes me thing that this is the reason why the ATT from Panel=True and allow_unbalanced_panel = True, would differ from the one in Panel=False? However, this only happens in version 2.1.1 and not in version 2.0.0 (that also has the same lines of code for parameter true_repeated_cross_sections).

My question:

To sum up, I am not sure why the parameters would be the same for version 2.0.0 and not in 2.1.1, after checking the code I understand that if I run Panel=True and allow_unbalanced_panel = True the difference with Panel=False would be in whether true_repeated_cross_sections is TRUE or FALSE. But the lines of code that use this parameter is in both versions of the package.

So my question is which version has the correct ATT (my guess would be 2.1.1, but I'd like to be sure)

Thank you for this incredible package, Verónica

bcallaway11 commented 2 years ago

Hi Veronica,

Thanks for the message. Do you get the same ATT(g,t)'s in each case?

I think between version 2.0.0 and 2.1.1, we changed how we calculated the relative group sizes (which affects the weights on aggregated parameters) but shouldn't change the ATT(g,t)'s.

For this part, in version 2.0.0, we were only calculating relative group sizes from their sizes in the first period, which makes sense if you have a balanced panel. After version 2.0.0, we switched to calculating group sizes using all available observations for unbalanced panels. In principle, neither approach is "wrong", but this change does cause estimates of aggregated parameters across versions to differ from each other (presumably, they should not differ too much), but using the full data (as in 2.1.1) should generally be the better choice.

All that said, if you are getting different ATT(g,t)'s, this would surprise me more, and I would have to dig into this more.

VeronicaCPerez commented 2 years ago

Hi Prof. Callaway,

Yes the ATT(g,t)'s are the same in both versions. Thank you so much for your answer, the difference is clear now.