Time Varying Covariate - Githubissues

bcallaway11 commented 2 years ago

We should think some about adding options related to time-varying covariates.

From @jtorcasso #71, which I think is a representative example:

Would it be possible to include support for time-varying covariates as a new feature?

Let's say we are evaluating a small program that rolls out in different zips across the US. We expect that the program won't affect things like state population or the unemployment rate, but that these could confound our estimate of the treatment effect. So we'd like to control this time-varying covariates.

Within the existing package, we could estimate the change in the unemployment rate between two fixed time points and include this as a time invariant control, but if we have staggered roll out, a fixed time point may not line up well depending on each group's treatment timing. We could then include changes from other time points, but this doesn't seem very parsimonious.

Instead, if the software supported time-varying controls separate from time-invariant controls, each 2x2 DID could calculate the change in the time-varying control (between the reference period and current time period) and then account for this covariate similarly to how the software handles time-invariant controls.

Would this be possible/useful?

Current Functionality We would take the covariates from the base period for both the treated group and the untreated group. So, in the above example, we would condition on pre-treatment unemployment levels in different zip codes. In my view, this is probably better than traditional regression DID's because (i) it probably allows for some forms of the treatment affecting the covariates, and (ii) it more obviously compares zip codes with similar unemployment rates (in particular, regressions probably either are comparing locations with the same change in unemployment (which seems awkward) or are likely to be much more highly dependent on the functional form being correctly specified).

That being said... It is at least worth thinking about allowing the user some control over this. I think that it would be medium-difficult to implement allowing the user to decide if they want to condition on something like (i) pre-treatment covariates, (ii) changes in covariates over time, or (iii) both.

I'm open to any feedback/user-comments on how useful this would be. Thoughts @jtorcasso, @pedrohcgs?

jtorcasso commented 2 years ago

@bcallaway11: I agree with (i), in most cases I'm wary of controlling for covariates that change after treatment, in case there is some effect of the treatment operating through the covariates. In that case, I wouldn't want to control for the change.

But I'm thinking of a situation where you may have a treatment that could not affect the covariate. For instance, let's say I wanted to determine the effect of a nation-wide, county-level apprenticeship program for persons with disabilities. The program probably doesn't affect regional unemployment rates, but changes to the regional economy (unemployment rates) could affect the unemployment rate of persons with disabilities. I could mistake changes associated with the regional labor market with the impacts of the program. In this case, it may be useful to control for the change in regional unemployment, in case counties tended to adopt the program during times of economic expansion.

To your point (ii), I'm wondering if, within this example, controlling for changes also "seems awkward," or if you think there is a different solution within your current framework.

To add on to your proposal: I would second allowing the user to specify both types of controls for covariates--both changes and the (base period) level. May also be useful to allow controls for changes AND percent changes.

bcallaway11 commented 2 years ago

Yeah, this is really interesting I think. I totally agree with you that there are lots of cases where the treatment wouldn't effect the covariates.

I also think you have a good example about the unemployment rate. In that example, we would currently just control for pre-treatment unemployment rates. What seems weird to me about only controlling for the change in unemployment rates is that regions whose unemployment rate went from, say, 9%-7% could serve as comparison units for regions that went from 3%-1%. That said, I think you could make a strong case that it would make sense to control for both the pre-treatment level and the change over time.

I don't really think that there is a big conceptual hurdle to implementing this either. It would just amount to allowing users to include \Delta X as a covariate (not manually, but us doing this behind the scenes if they say that they want this).

I'm leaning towards implementing this, but I'll probably need some time....

jtorcasso commented 2 years ago

Sounds good. Do you think a good UX would be two separate x formulas? I'd offer to help, but I'm not sure how localized the change would be, or if it would require changing many functions. The biggest difficulty I foresee is that this new X, "delta X", would have to be defined at the time of estimation, since it depends on g and t. So its not like you can just define new Xs and carry them around business as usual.

bcallaway11 commented 2 years ago

Yes, getting the right interface might require some thought. I'm not dead-set on this, but I'm kind of disinclined to have a second x formula for time varying covariates. Perhaps we could add an extra argument called time_varying_covs that can take the following values:

"pre" - and just go with the existing behavior
"change" - only use the change in covariates over time
"both" - include both the pre-treatment level and change
This doesn't give tight control over specifying different combinations here, though I don't suspect this is a main case. Could also allow users to pass covariates in by name here.

I think the only place where the code would need to be adjusted is here. At that line, disdat is two periods of panel data. You would only need to figure out which covariates vary over time and include them in "x" going forward.

Some more things that are worth thinking about though:

All this is not feasible at all with repeated cross sections
The solution above is not going to work with unbalanced panel data either (this might be a much harder case).

bcallaway11 / did

Time Varying Covariate #96