JMSLab / eventstudyr

Other
24 stars 1 forks source link

Prepare EventStudyFHS() #11

Closed nateschor closed 1 year ago

nateschor commented 2 years ago

In this issue we'll code an initial version of the EventStudyFHS helper function outlined here. We'll also update EventStudy and EventStudyPlot to handle IV.

Next steps and person who will work on them (can be modified):

Add FHS examples/remove external package calls:

Write FHS Tests for:

Code Review

rcalvo12 commented 2 years ago

Per RA Call: @ew487 is continuing to work on implementation. Will add updates soon.

ew487 commented 2 years ago

Update: EventStudy(), EventStudyFHS(), and PrepareModelFormula() functions modified to handle IV. To do: update EventStudyPlot()

ew487 commented 2 years ago

Hi @jmshapir, would it be possible to make me have write access to this repository?

jmshapir commented 2 years ago

Hi @jmshapir, would it be possible to make me have write access to this repository?

@ew487 done, thanks!

ew487 commented 2 years ago

Got it, thanks @jmshapir!

@nateschor Sorry for the delay, I added the functions!

nateschor commented 2 years ago

Thanks @ew487! I'm running the example code in EventStudyFHS() and getting the error below--is this happening on your end too?

model_formula <-  PrepareModelFormula(
estimator = "FHS", outcomevar = "y_base",
str_policy_fd = c("z_fd", "z_fd_lead2", "z_fd_lead3", "z_fd_lag1", "z_fd_lag2"),
str_policy_lead = "z_lead3",
str_policy_lag = "z_lag3",
controls = "x_r",
proxy = "proxy"
)

image

ew487 commented 2 years ago

Hi @nateschor thanks for pointing this out! I tried this code and I get the same error, and if I specify a value for proxyIV then it works. Does that work for you as well?

Related to this, maybe we can consider implementing a case where if proxy is specified but not proxyIV, then we automatically use the default like in xtevent, what do you think about that?

nateschor commented 2 years ago

Hi @nateschor thanks for pointing this out! I tried this code and I get the same error, and if I specify a value for proxyIV then it works. Does that work for you as well?

@ew487 would you be able to push to GitHub the change you made that solved the error so that we can make sure we're on the same page?

Related to this, maybe we can consider implementing a case where if proxy is specified but not proxyIV, then we automatically use the default like in xtevent, what do you think about that?

We've been trying to write our package so that eventstudyr and xtevent agree (as much as possible), so that sounds great to me! Maybe adding a message or warning making the user aware of this behavior could help too (in addition to putting it in our documentation)?

ew487 commented 2 years ago

@nateschor I just pushed a script with examples in it. It should work when run from the main directory but please let me know if it doesn't! Later I will update the documentation and data to reflect these examples (and delete the script).

Ok, sounds good & thanks again!

ew487 commented 2 years ago

Unrelated update: I'm noticing another potential discrepancy between stata and R regarding including the intercept and will look into this soon.

ew487 commented 2 years ago

@nateschor The examples should run now, hopefully!

nateschor commented 2 years ago

@nateschor The examples should run now, hopefully!

The examples are running for me, thanks so much @ew487!

nateschor commented 2 years ago

@ew487 @rcalvo12 when I run the FHS code from the EventStudy documentation and try using it with EventStudyPlot (below):

library(dplyr)
library(magrittr)

data <- df_sample_dynamic %>% select(y_base, z, id, t, x_r, eta_m)

results <- EventStudy(estimator = "FHS", data = data, outcomevar = "y_base", policyvar = "z",
idvar = "id", timevar = "t", controls = "x_r", proxy = "eta_m", proxyIV = "z_fd_lead3",
FE = TRUE, TFE = TRUE, post = 3, overidpost = 1, pre = 0, overidpre = 3, normalize = -1,
cluster = TRUE)

EventStudyPlot(estimates = results,
xtitle = "Event time",
ytitle = "Coefficient",
ybreaks = c(-1.5, -.5, 0, .5, 1.5),
conf_level = .95,
Supt = .95,
num_sim = 1000,
seed = 1234,
Addmean = FALSE,
Preeventcoeffs = TRUE,
Posteventcoeffs = TRUE,
Nozeroline = FALSE,
Smpath = NULL)

I get this error message:

image

After doing some poking around, it looks like the cause is the linear hypothesis test on pretrends performed here, since linearHypothesis() takes pretrends_hyp as an argument and it includes "z_fd_lead3=0", but z_fd_lead3 isn't present in our estimated model

image

image


Are you both getting the same error? And if so, it seems like the solution is either:

I'm not sure though that we want to be performing the hypothesis test on our proxy variables? z_fd_lead3 is our provyIV in this example

Thank you!

ew487 commented 2 years ago

@nateschor Thanks for pointing this out! I'll try to run it on my computer to get the same error. Do you happen to know how the stata package handles this?

nateschor commented 2 years ago

@nateschor Thanks for pointing this out! I'll try to run it on my computer to get the same error. Do you happen to know how the stata package handles this?

@ew487 anytime! No I do not, we can always ask Jorge or Constantino if we're not able to find it in the xtevent source code

rcalvo12 commented 2 years ago

@nateschor @ew487 Fix to issue reported in https://github.com/JMSLab/eventstudyr/issues/11#issuecomment-1230475948 pushed in https://github.com/JMSLab/eventstudyr/commit/ca288545a53ecacfad4d6769aee88505500c085d.

nateschor commented 1 year ago

The eventstudyr plot for FHS now gets generated without errors, which is nice! However, there is some disagreement with the results from xtevent.

For R, I ran:

library(dplyr)
library(magrittr)
data <- df_sample_dynamic %>% select(y_base, z, id, t, x_r, eta_m)
results <- EventStudy(estimator = "FHS", data = data, outcomevar = "y_base", policyvar = "z",
idvar = "id", timevar = "t", controls = "x_r", proxy = "eta_m", proxyIV = "z_fd_lead3",
FE = TRUE, TFE = TRUE, post = 2, overidpost = 2, pre = 2, overidpre = 2, normalize = -2,
cluster = TRUE)

EventStudyPlot(estimates = results,
xtitle = "Event time",
ytitle = "Coefficient",
ybreaks = c(-15,-10, -5, 0, 5, 10, 15),
conf_level = .95,
Supt = .95,
num_sim = 1000,
seed = 1234,
Addmean = TRUE,
Preeventcoeffs = TRUE,
Posteventcoeffs = TRUE,
Nozeroline = FALSE,
Smpath = NULL)

image

For Stata, I grabbed simulation_data_dynamic.dta from here and ran:

xtevent y_base x_r, pol(z) panelvar(id) timevar(t) post(2) overidpost(2) pre(2) overidpre(2) norm(-2) proxy(eta_m) plot

image

Some takeaways:

In R, we use proxyIV = "z_fd_lead3" in eventstudyr, but I wasn't sure how to do that in xtevent (I'm not sure if that would impact the CIs and supt bands?). Other than that, it looks to me that we have bugs in AddSuptBand and AddCIs.

@ew487 @rcalvo12 please let me know what you think, thanks!

jmshapir commented 1 year ago

Thanks @nateschor!

@SimonFreyaldenhoven would you have time to sit down and look at this with @nateschor?

If even the pointwise CIs are differing by 2X between R and Stata, there's probably something basic we're doing differently between the two. I'm guessing a short conversation might surface what it is.

jorpppp commented 1 year ago

@nateschor bear in mind that xtevent does not cluster by default. I think you are clustering by unit in R. So the equivalent command in Stata should be xtevent y_base x_r, pol(z) panelvar(id) timevar(t) post(2) overidpost(2) pre(2) overidpre(2) norm(-2) proxy(eta_m) cluster(id) plot

@jmshapir @SimonFreyaldenhoven FYI

SimonFreyaldenhoven commented 1 year ago

@nateschor Thanks!

I agree with @jmshapir that it might make sense to sit down and take a look together what exactly we are estimating. I'll try to put something on our calendar (though it may have to be after the weekend).

One immediate issue would be that I am not sure we are estimating the same model. It looks like the excluded coefficients are at -2 and -3 in R, and -2 and -4 in Stata. If that's the case, I'd expect the CI's to indeed be wider in R.

Maybe we can use the standard -1 as the normalization for now, and then make sure we use the same lead as instrument. I believe one can choose the lead to be used as an IV explicitly in Stata too. It should be in the helpfile (or @jorpppp probably also knows)

A couple of other small points you could look at in the meantime if the problem persists (we might have to meet after the weekend), though maybe they are already addressed in a different issue:

jorpppp commented 1 year ago

@SimonFreyaldenhoven I'd be happy to join the meeting with @nateschor next week if it helps.

nateschor commented 1 year ago

@nateschor bear in mind that xtevent does not cluster by default. I think you are clustering by unit in R. So the equivalent command in Stata should be xtevent y_base x_r, pol(z) panelvar(id) timevar(t) post(2) overidpost(2) pre(2) overidpre(2) norm(-2) proxy(eta_m) cluster(id) plot

Thanks @jorpppp! I tried running xtevent y_base x_r, pol(z) panelvar(id) timevar(t) post(2) overidpost(2) pre(2) overidpre(2) norm(-2) proxy(eta_m) cluster(id) plot and get the error:

option cluster() not allowed

Is this the expected behavior?

image

jorpppp commented 1 year ago

@nateschor It's not, looks like a bug. Try:

xtevent y_base x_r, pol(z) panelvar(id) timevar(t) post(2) overidpost(2) pre(2) overidpre(2) norm(-2) proxy(eta_m) vce(cluster id) plot

jorpppp commented 1 year ago

@jmshapir Can we add Constantino to this repository, please? I'd like to link this issue to an issue on xtevent about this clustering thing. It seems that both cluster and vce(cluster) options are enabled for OLS estimation and for IV with reghdfe , but they do not seem to be active for IV estimation without reghdfe.

jmshapir commented 1 year ago

@jmshapir Can we add Constantino to this repository, please? I'd like to link this issue to an issue on xtevent about this clustering thing. It seems that both cluster and vce(cluster) options are enabled for OLS estimation and for IV with reghdfe , but they do not seem to be active for IV estimation without reghdfe.

@jorpppp done! When we post the issue in https://github.com/JMSLab/xtevent, I'd prefer that we make it self-contained rather than pointing back to https://github.com/JMSLab/eventstudyr, just because the latter is (for now) not public.

nateschor commented 1 year ago

@nateschor It's not, looks like a bug. Try:

xtevent y_base x_r, pol(z) panelvar(id) timevar(t) post(2) overidpost(2) pre(2) overidpre(2) norm(-2) proxy(eta_m) vce(cluster id) plot

That worked, thanks @jorpppp! Still seeing the discrepancy in CIs and Supt between xtevent and eventstudyr and will continue to investigate

jorpppp commented 1 year ago

@jmshapir Can we add Constantino to this repository, please? I'd like to link this issue to an issue on xtevent about this clustering thing. It seems that both cluster and vce(cluster) options are enabled for OLS estimation and for IV with reghdfe , but they do not seem to be active for IV estimation without reghdfe.

@jorpppp done! When we post the issue in https://github.com/JMSLab/xtevent, I'd prefer that we make it self-contained rather than pointing back to https://github.com/JMSLab/eventstudyr, just because the latter is (for now) not public.

Thanks @jmshapir! @Constantino-Carreto-Romero, can you check the above issue with cluster and vce, please? I remember we worked on this in issue 12 of xtevent but it seems there's still a syntax that does not work.

Constantino-Carreto-Romero commented 1 year ago

Thanks @jmshapir! @Constantino-Carreto-Romero, can you check the above issue with cluster and vce, please? I remember we worked on this in issue 12 of xtevent but it seems there's still a syntax that does not work.

@jorpppp sure, I will start to review it.

nateschor commented 1 year ago

Following up on @SimonFreyaldenhoven's https://github.com/JMSLab/eventstudyr/issues/11#issuecomment-1240070359, I used -1 for the normalization below

eventstudyr:

library(dplyr)
library(magrittr)
data <- df_sample_dynamic %>% select(y_base, z, id, t, x_r, eta_m)
results <- EventStudy(estimator = "FHS", data = data, outcomevar = "y_base", policyvar = "z",
idvar = "id", timevar = "t", controls = "x_r", proxy = "eta_m", proxyIV = "z_fd_lead3",
FE = TRUE, TFE = TRUE, post = 2, overidpost = 2, pre = 2, overidpre = 2, normalize = -1,
cluster = TRUE)

EventStudyPlot(estimates = results,
               xtitle = "Event time",
               ytitle = "Coefficient",
               ybreaks = c(-15, -10, -5, 0, 5, 10, 15),
               conf_level = .95,
               Supt = .95,
               num_sim = 1000,
               seed = 1234,
               Addmean = TRUE,
               Preeventcoeffs = TRUE,
               Posteventcoeffs = TRUE,
               Nozeroline = FALSE,
               Smpath = NULL)

image

My attempt at running the same model with xtevent:

xtevent y_base x_r, pol(z) panelvar(id) timevar(t) post(2) overidpost(2) pre(2) overidpre(2) norm(-1) proxy(eta_m) proxyiv(4) vce(cluster id) plot

image

@jorpppp @Constantino-Carreto-Romero is proxyiv(4) the correct way to select the 4th first-differenced lead as the proxy variable?

In addition to still having the disagreement on confidence intervals and supt band that was present in https://github.com/JMSLab/eventstudyr/issues/11#issuecomment-1239834389, we also have disagreements here in:

  1. which lead gets normalized (1 in eventstudyr vs. 3 in xtevent)
  2. the lags are all positive in eventstudyr vs. all negative in xtevent
  3. p-values
  4. mean of the dependent variable in the normalization period (which makes sense, given 1.)

@rcalvo12, @ew487 and I still need to do more FHS testing, so I'm not as concerned about 2 and 3 right now, but am curious about 1.

@jmshapir FYI

SimonFreyaldenhoven commented 1 year ago

@nateschor I agree we should first figure out 1). Changing the normalization in the IV estimates will affect everything else.

Try running

xtevent y_base x_r, pol(z) panelvar(id) timevar(t) post(2) pre(0) norm(-1) proxy(eta_m) proxyiv(4) vce(cluster id) plot

my guess is that the pre(int) option specifies the number of anticipation periods in both packages, but that the norm(int) option defines

Constantino-Carreto-Romero commented 1 year ago

@jorpppp @Constantino-Carreto-Romero is proxyiv(4) the correct way to select the 4th first-differenced lead as the proxy variable?

@nateschor that's correct.

nateschor commented 1 year ago

@SimonFreyaldenhoven and I met and were able to reconcile the differences between xtevent and eventstudyr! The issue was that the model I intented to run in xtevent was different from what I was actually running.

xtevent:

xtevent y_base, pol(z) panelvar(id) timevar(t) window(3) proxy(eta_m) proxyiv(3) vce(cluster id) plot

eventstudyr:

library(dplyr)
library(magrittr)
data <- df_sample_dynamic %>% select(y_base, z, id, t, x_r, eta_m)
results <- EventStudy(estimator = "FHS", data = data, outcomevar = "y_base", policyvar = "z",
idvar = "id", timevar = "t", proxy = "eta_m", proxyIV = "z_fd_lead3",
FE = TRUE, TFE = TRUE, post = 3, overidpost = 1, pre = 0, overidpre = 3, normalize = -1,
cluster = TRUE)

EventStudyPlot(estimates = results,
                xtitle = "Event time",
                ytitle = "Coefficient",
                ybreaks = seq(-5, 10, 5),
                conf_level = .95,
                Supt = .95,
                num_sim = 1000,
                seed = 1234,
                Addmean = TRUE,
                Preeventcoeffs = TRUE,
                Posteventcoeffs = TRUE,
                Nozeroline = FALSE,
                Smpath = NULL) + geom_hline(yintercept = c(-5, 5, 10), color = "gray")

The point estimates, standard errors, and lower & upper bounds on the confidence intervals now match (note that in eventstudyr we currently only compute confidence intervals for eventstudy coefficients, which is why the proxy eta_m has NA for its bounds):

image image

The accompanying plots look similar as well:

image image


We also noticed below that despite R and Stata giving the same coefficient estimate and standard error, they give different test statistics. 0.92212593 / 1.0615173 = 0.8686867 so R's is incorrect. We compute our own confidence intervals in eventstudyr so this should not be an issue, but flagging it here just in case.

Thank you!

image image

SimonFreyaldenhoven commented 1 year ago

@nateschor thanks! Looks great!

I think having point estimates and se's correct is a big step in the right direction.

One thing I noted in the figures above is that the "baseline value" (the shown value on the y-axis), and both p-values are all similar but different between Stata and R. I'm guessing the p-value difference might be related to the point you flagged above - that the z/t -statistics appear slightly off in R.

nateschor commented 1 year ago

@ew487 @rcalvo12 in https://github.com/JMSLab/eventstudyr/commit/5230a8f02fb04d967d508603d983256a7e3ea2a9 I pushed all of the changes used to create the tables and plots in https://github.com/JMSLab/eventstudyr/issues/11#issuecomment-1244109103

I modified https://github.com/JMSLab/eventstudyr/issues/11#issue-1309859657 with what I could think of as the remaining tasks for this issue, as well as who it might make the most sense to complete it. Please add to and edit the list!

Thank you!

ew487 commented 1 year ago

Hi @nateschor @SimonFreyaldenhoven, I think this issue

We also noticed below that despite R and Stata giving the same coefficient estimate and standard error, they give different test statistics. 0.92212593 / 1.0615173 = 0.8686867 so R's is incorrect. We compute our own confidence intervals in eventstudyr so this should not be an issue, but flagging it here just in case.

might originate in the EventStudyFHS() function that I wrote. For certain cases I implemented a degrees of freedom correction on the original R output to get standard errors matching Stata's, but I only edited that one part of the output. Maybe if I change the code to overwrite the t values also, then this will solve the problem.

jmshapir commented 1 year ago

For certain cases I implemented a degrees of freedom correction on the original R output to get standard errors matching Stata's, but I only edited that one part of the output. Maybe if I change the code to overwrite the t values also, then this will solve the problem.

@ew487 @nateschor @SimonFreyaldenhoven this is probably worth discussing "live," but my guess is that we want both t-statistics and F-statistics to be functions of the (corrected) variance-covariance matrix for the coefficients. (F-statistics matter here too because they are probably the basis for the tests we're reporting.)

nateschor commented 1 year ago

Per call:

@ew487 will change this to if (estimator == "FHS") and will modify FHS to allow multiple proxyIV variables and to default to using the strongest lead as an instrument when estimator == "FHS" and proxy is specified. We were unsure about how to read the final line below from the design document, but converged on the plan above.

After changes to FHS have stabilized, we will continue with the modifications here

image

jmshapir commented 1 year ago

@nateschor thanks!

Is the question what we mean by "strongest lead"?

nateschor commented 1 year ago

@jmshapir I think we are good on what is meant by "strongest lead"!

We were confused about the scenario where the user does not specify a value for proxyIV but is doing FHS. Just like stevent does, in this case we wanted to default to using the strongest lead. However, since the user is not explicitly "specifying" what to use for proxyIV (they are using a default), we were unsure if this behavior was in conflict with the language in the design document.

We converged on defaulting to using the strongest lead as an instrument when estimator == "FHS" and proxy is specified, but please let us know if you disagree!

jmshapir commented 1 year ago

Thanks @nateschor!

My thinking based on the original design doc is something like this:

I think (?) this is in line with your summary but let me know if you see it differently, thanks!

ew487 commented 1 year ago

@jmshapir I was the source of the disagreement here so I thought I should chime in. I had interpreted {should specify proxyIV iff estimator specified as FHS} to mean whenever FHS is specified, proxyIV must be specified also and vice versa--which leaves no room for a default. But at the end we decided this is not the interpretation to go with and to my understanding we decided to implement what you are saying here.

@nateschor please let me know if I've missed something.

nateschor commented 1 year ago

@ew487 just want to make sure we're on the same page--I'm planning on waiting until there are tests for EventStudyFHS before I write more code (examples and testing) that relies on FHS just in case your testing leads to changes in the main FHS function. Is that what you were thinking too?

Thanks!

ew487 commented 1 year ago

@nateschor Thanks for checking. Yes, that's what I was thinking too! What is your timeline for this? Are you looking to get started soon?

nateschor commented 1 year ago

@ew487 I have some extra time now so being able to start soon would be great!

ew487 commented 1 year ago

@nateschor Ok, I see! I have no more time to work on this this week but I can try to see if I can fit something in over the weekend. Alternatively, I could let you know what I am working on if you want to give it a go? Sorry for the holdup. Let me know what you want to do!

nateschor commented 1 year ago

@ew487 thanks!

Alternatively, I could let you know what I am working on if you want to give it a go?

At least so far, we've followed the convention of whoever wrote the initial function also writes the tests for that function. I can see an argument for having one person write the function and another person test it, but having one person specialize in each function seems like it has made it easier to fix bugs when they arise. As long as it's okay with you, my preference is to have you write the accompanying tests for FHS (but happy to discuss!)

I have no more time to work on this this week but I can try to see if I can fit something in over the weekend.

I definitely don't want you to have to work over the weekend for this if working over the weekend is something you don't normally do! I just wanted to make sure we were on the same page in terms of next steps and to keep the eventstudyr momentum going, especially since we're getting close to beginning the CRAN submission process:)

ew487 commented 1 year ago

@nateschor Got it. In that case I will do my best to pick up the pace here. Thanks for keeping this on track!

ew487 commented 1 year ago

Hi @jmshapir, I have an update and a question I was hoping you could help me with. It seems like this is what is going on. The unadjusted estimatr:iv_robust output exactly matches the results from ivregress 2sls in stata:

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | base | estimatr -- | -- | -- FE,  TFE, !cluster | ivregress 2sls | unadjusted FE,  TFE,  cluster | ivregress 2sls | unadjusted FE, !TFE, !cluster | ivregress 2sls | unadjusted FE, !TFE,  cluster | ivregress 2sls | unadjusted !FE,  TFE, !cluster | ivregress 2sls | unadjusted !FE,  TFE,  cluster | ivregress 2sls | unadjusted !FE, !TFE, !cluster | ivregress 2sls | unadjusted !FE, !TFE,  cluster | ivregress 2sls | unadjusted

In xtevent, we are using xtivreg instead of ivregress 2sls for cases with FE and ivregress 2sls for cases without. When I overwrite the estimatr::iv_robust output with DF adjustment in cases with FE, this is to get the output to match with xtivreg:

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | xtevent | estimatr -- | -- | -- FE,  TFE, !cluster | DNE | ??? FE,  TFE,  cluster | xtivreg | adjusted FE, !TFE, !cluster | DNE | ??? FE, !TFE,  cluster | xtivreg | adjusted !FE,  TFE, !cluster | ivregress 2sls | unadjusted !FE,  TFE,  cluster | ivregress 2sls | unadjusted !FE, !TFE, !cluster | ivregress 2sls | unadjusted !FE, !TFE,  cluster | ivregress 2sls | unadjusted

And now I am wondering whether we want to use adjusted or unadjusted for the cases with ???. Would you let me know what you think? Thanks!

jmshapir commented 1 year ago

Thanks @ew487 for a very helpful summary!

To clarify:

ew487 commented 1 year ago

Thanks @jmshapir, here is what I think is happening:

Does xtivreg permit the pathaways denoted as DNE? For example, in xtivreg can the user have time and unit fixed effects without clustering standard errors by unit? And, if so, do the standard errors match the "adjusted or "unadjusted" flavor?

I don't think it is possible to get non-cluster robust standard errors with xtivreg, fe. If you specify vce(robust) in a regression with fixed effects then xtivreg will automatically converted to vce(cluster id). It seems like this was designed this way deliberately because with fixed effects, non-cluster robust standard errors may be inconsistent.

Does DNE denote a pathway that is currently impossible under xtevent? For example, in xtevent if the user wants both time and unit fixed effects, is the user forced to cluster standard errors by unit?

Based on the code here I think xtevent implements fixed effects regressions via xtivreg, so based on the above it should not be possible to get non-cluster robust standard errors. In your example, the only other option is to use non-robust errors.