bernardodionisi / differences

difference-in-differences in Python
https://bernardodionisi.github.io/differences/latest/
GNU General Public License v3.0
92 stars 19 forks source link

Inconsistent cohort definition? #3

Open aayala15 opened 1 year ago

aayala15 commented 1 year ago

Hi there. I'm a little bit confused by how the cohort variable is defined and later used within the ATTgt class. Let me provide an example here to determine whether this is an actual bug or I'm just misunderstanding something. Let's start by using the simulate_data function to simulate some (panel) data.

panel_data = simulate_data(n_cohorts=5, intensity_by=1, tau=0.0, low=2.0, high=2.0)
panel_data.head()
y x0 w cat.0 cat.1 effect cohort strata intensity
('e0', 1900) 0.069472 0.160245 3.72983 0 0 0 1903 0 2
('e0', 1901) 8.95894 -1.9334 1.6912 0 0 0 1903 0 2
('e0', 1902) 8.24309 -0.0568131 1.78633 0 0 0 1903 0 2
('e0', 1903) 5.05815 0.810118 1.74161 0 0 0 1903 0 2
('e0', 1904) 15.2352 0.101205 0.806004 0 0 20 1903 0 2

The simulated data set the cohort variable as the last time step before the intervention (e.g. 1903 for unit e0 when treatment effect kicks in in 1904). This cohort definition is non-standard as it is usually defined as the start of treatment. I then proceed to fit the model and plot the estimated ATTs.

att = ATTgt(data=panel_data, cohort_name="cohort")
att.fit(formula="y", est_method="reg")
att.plot(color_by="cohort", shape_by="post")
image

Resulting plot looks correct, with marker shapes correctly encoding pre- and post-intervention periods (e.g. for cohort = 1903, time step 1903 is labeled as pre, while 1904 is labeled as post).

att.aggregate("event")
att.plot("event")

However, there seems to be inconsistency with the above cohort definition when aggregating the ATTs. For instance, when I run the aggregate method for the "cohort" level, I get the following results,

att.aggregate("cohort")
cohort ATT std_error lower upper zero_not_in_cband
1902 16.8192 1.18693 14.4929 19.1456 *
1903 15.6355 1.18963 13.3039 17.9671 *
1904 15.5679 1.14973 13.3145 17.8213 *
1905 13.5871 1.17521 11.2837 15.8904 *

The average ATTs presented here follow the standard cohort definition (i.e. start of the intervention), as they include the time == cohort data points. This clearly biases the estimated ATTs, as it includes a pre-intervention time period. This is easy to see by manually aggregating the ATTs. Clearly the second result, which does not include the time == cohort estimates, is the right one.

att.results().query("time >= cohort").iloc[:, 0].groupby(level=0).mean()
cohort
1902    16.819243
1903    15.635493
1904    15.567925
1905    13.587051
Name: (ATTgtResult, , ATT), dtype: float64
att.results().query("time > cohort").iloc[:, 0].groupby(level=0).mean()
cohort
1902    20.142418
1903    19.561096
1904    20.113236
1905    19.812459
Name: (ATTgtResult, , ATT), dtype: float64

Hence, it seems like both the simulate_data function and the (disaggregated) plot method generate/expect a non-standard cohort definition which is inconsistent with other functions/methods in the package. Thoughts? Sorry if I'm missing something.

bernardodionisi commented 1 year ago

Hi Andres,

thank you for noticing this and for raising the issue!

I think the confusion comes from the "post" label attached to the disaggregated plot.

And not from how cohort is defined, where cohort is the date/time of the treatment for a focal entity.

The disaggregated plot

There is indeed an inconsistency when one calls .plot() without any aggregation: disaggregated plot. In the disaggregated plot post is defined as time > cohort (attgt - utility.py - L1021), while in the aggregations: when one calls .aggregate(), post is defined as time >= cohort (attgt - utility.py - L334).

Post-treatment period should be defined as in the aggregations: time >= cohort

that is >= a relative period of 0, which is the period when the treatment is administered.

I will fix this for a new release, in the meantime you can still use the same method and get a plot that uses time >= cohort.

After running your example:

from differences import ATTgt, simulate_data

panel_data = simulate_data(n_cohorts=5, intensity_by=1, tau=0.0, low=2.0, high=2.0)

att = ATTgt(data=panel_data, cohort_name="cohort")

att.fit(formula="y", est_method="reg")

you can run:

import numpy as np
from differences.tools.utility import single_idx

# just to process the ATTgt output to a format easier to plot
_, res = single_idx(att.results())

res = (
    res
    .assign(
        post=lambda x: np.where(x["cohort"] <= x["time"], "post", "pre"),
        cohort=lambda x: x['cohort'].astype(str)  # a string plays nicer with the legend (which is interactive)
    )
)

# if you want the plot to look like the ones in this package
# you can still use the ATTgt method which has some preset parameters: just change the data for now
chart = att.plot(data=res, color_by="cohort", shape_by="post")

The simulated data

Regarding the simulate_data function as well as some points you made, such as the correct estimate being the one with time > cohort, I think the reasoning is guided by knowing (in the generated fake data) the true effect on the outcome, which in the example above is an effect of 20 starting 1 period after the treatment.

To put it differently, if, in the fake data generated with simulate_data, instead of letting the effect on the outcome start 1 period after the treatment we started it 3 periods after and I had not told you when the effect started and the size of the effect, how would you define the post period?

The post period is defined as post relative to the treatment date (time >= cohort) and not post relative to when the outcome changes.

In a fake dataset we can manually define the size of the treatment effect on the outcome and the time when the effect starts kicking in, but that is not known in a non-simulated dataset. So I think there was confusion created by having access (in this example) to the "effect" column.


In this example, when calling simulate_data, you can change when the treatment starts having an effect on the outcome. So for example, if you run the following, you will get the outcome changing at time = cohort instead of at time = cohort + 1.

from differences import simulate_data

panel_data = simulate_data(
    n_cohorts=5,
    intensity_by="cohort",
    tau=-1,  # <--- change here
    alpha=1,  # and here
    low=2,
    high=2
)
y x0 w cat.0 cat.1 effect cohort intensity
('e0', 1900) -5.34881 0.160245 3.72983 0 0 0 1904 2
('e0', 1901) -6.3529 -1.9334 1.6912 0 0 0 1904 2
('e0', 1902) -9.215 -0.0568131 1.78633 0 0 0 1904 2
('e0', 1903) -1.37983 0.810118 1.74161 0 0 0 1904 2
('e0', 1904) 19.6909 0.101205 0.806004 0 0 20 1904 2
('e0', 1905) 17.6973 0.528641 0.398621 0 0 20 1904 2
('e0', 1906) 25.2076 0.0325095 0.905004 0 0 20 1904 2
('e0', 1907) 31.0072 1.74987 1.0075 0 0 20 1904 2

You can also change a number of other things in simulate_data, however, I have not documented it yet because I have not finalized what the parameters should be called or the values to be passed to it.

I have also not checked or tested the simulate_data function to make sure that it works how one would expect it to work; I am using it in the examples of the package to give a sense of what kind of data to feed into the functionality of the package, in order to get people started.

I am not sure if other packages in R or Stata have a function to simulate the data, but simulate_data would be different from what other packages provide. As soon as I find some time I will make sure it outputs data that is more intuitive to understand and to play around with.

Since I have posted the package I have had very little time to work on it, but I hope to be able to make some improvements and additions soon! So thank you very much for taking the time to raise this issue and please let me know if I misunderstood or have not fully answered your points.

aayala15 commented 1 year ago

Hi Bernardo. Thanks for the detailed response. I had identified the same line (attgt - utility.py - L1021) as the source of the problem.

Regarding the simulate_data function, there is nothing inherently wrong with it. It is great that it supports anticipated/delayed responses, heterogeneity across time and cohorts, etc. I think my comment is more about the choice of default parameters. The did R package also allows one to simulate data. However, the default configuration is more plain vanilla, which makes it easier to understand the results when interacting with the package for the first time. Better documentation would have definitely helped.

Happy to help with improvements and new features.

bernardodionisi commented 1 year ago

Definitely agree that the defaults of simulate_data should be more intuitive.


About helping, that would be wonderful, did you have something in mind?

Some things that I think would enhance the package significantly are:

Improvements in the documentation, especially functions, and classes related to the attgt and did modules; additionally, I think, it would be ideal if the package explained what is happening when one calls ATTgt and its various methods, maybe in a notebook.

Tests also need a lot of improvement, right now they are just testing that the results are consistent with the R did package, ideally the package would need much better code coverage.

When I’ll have some time, I will try to lay out some improvements in the code and features that I think could be implemented, but of course, feel free to open issues about enhancements or anything. I may not be able to respond quickly in the next couple of months but I'll reply asap.