MIDS-at-Duke / unifying-data-science-final-project-pandemics-unemployment

unifying-data-science-final-project-epidemics-economy created by GitHub Classroom
0 stars 1 forks source link

Unemployment trends are the same across all states (those with stay-home-orders and without stay-home-orders) #4

Closed josemoscoso-duke closed 4 years ago

josemoscoso-duke commented 4 years ago

@nickeubank , we have the following questions: While doing EDA on our data, we realized that even the states without stay-at-home order have a huge increase on unemployment claims in the last weeks. We can’t do diff-in-diff in the state level because we will be violating the SUTVA assumption. We looked for county level unemployment data, and we only found data aggregated by month, we believe that we need more data points after the order was placed to have a meaningful diff-in-diff. And lastly, even if we find more granular unemployment data in a county level, we are afraid that the claims will increase in ALL the counties even without stay-at-home orders (same behavior as the one we observed in the state-level) We would appreciate any suggestions you might have.

In the meantime, we will follow our analysis as it is, trying to derive any other interesting insight from the datasets we collected?

you-juli commented 4 years ago

In the regression analysis, we regressed weekly unemployment filings per 10,000 people on stay-at-home order (binary variable) and all other demographic control variables on state level. We can see stay-at-home order is significant with very small p-value and positive coefficient. (Please see the jupyter notebook in Regression_Analysis folder https://github.com/MIDS-at-Duke/unifying-data-science-final-project-pandemics-unemployment.) Can we say that although we cannot the quantify the effect of stay-at-order from the diff-in-diff graph, we can see the significant impact of the order and quantify the impact through Regression Analysis? Also, I guess the reason that we cannot find what we are looking for in diff-in-diff graph is that COVID-19 is a global pandemic and all the states are directly or directly impacted even before the order.

nickeubank commented 4 years ago

Comments mañana!

abbarcenasj commented 4 years ago

Gracias, @nickeubank !

nickeubank commented 4 years ago

While doing EDA on our data, we realized that even the states without stay-at-home order have a huge increase on unemployment claims in the last weeks. We can’t do diff-in-diff in the state level because we will be violating the SUTVA assumption.

My guess is that while you were focused on statewide orders, lots of cities were imposing stay at home orders before the state level orders came out.

We looked for county level unemployment data, and we only found data aggregated by month, we believe that we need more data points after the order was placed to have a meaningful diff-in-diff.

A difference in difference can be done with only four data points – pre-treatment, pre-non-treatment, post treatment, post non treatment.

And lastly, even if we find more granular unemployment data in a county level, we are afraid that the claims will increase in ALL the counties even without stay-at-home orders (same behavior as the one we observed in the state-level)

Given why I think you are seeing the effects you are seeing above, I don’t think that would happen at the same level. With that said, it would still be interesting to see if government ordered stay at home directives don’t have a marginal effect above and beyond what happens when people choose to social distance as individuals.

nickeubank commented 4 years ago

In the regression analysis, we regressed weekly unemployment filings per 10,000 people on stay-at-home order (binary variable) and all other demographic control variables on state level. We can see stay-at-home order is significant with very small p-value and positive coefficient. (Please see the jupyter notebook in Regression_Analysis folder https://github.com/MIDS-at-Duke/unifying-data-science-final-project-pandemics-unemployment.) Can we say that although we cannot the quantify the effect of stay-at-order from the diff-in-diff graph, we can see the significant impact of the order and quantify the impact through Regression Analysis?

No — my guess is that in that specification, your stay at home indicator variable is just proxying for being a later date (further into the economic crisis). Put differently, the observations under a stay at home order can do occur later in time, and so have very different potential outcomes from the observations without stay at home orders which happened earlier in your timeperiod.

abbarcenasj commented 4 years ago

Ok, that makes sense. Thanks for the quick reply, @nickeubank. This is what I understood as the next steps:

1) Pick one state in which some counties have placed stay-at-home order and some others haven't. Look at the monthly unemployment RATE (instead of unemployment claims as proposed in the beginning). Find two counties (control and treatment) in the same state that have parallel trends and perform (1) pre-post analysis and (2) diff-in-diff analysis. Am I right? 2) Do not report the linear regression results because we are not capturing the stay-at-home order effect that we want.

As a suggestion, we could try to find as many pairs of counties as possible to perform diff-in-diff and maybe average the result to have a more general understanding of the effect of the stay-at-home order. Does that make sense, @nickeubank?

nickeubank commented 4 years ago

Pick one state in which some counties have placed stay-at-home order and some others haven't. Look at the monthly unemployment RATE (instead of unemployment claims as proposed in the beginning). Find two counties (control and treatment) in the same state that have parallel trends and perform (1) pre-post analysis and (2) diff-in-diff analysis. Am I right?

That would definitely work, but you probably can’t do a parallel trans analysis since you won’t have much data for the pre-period, will you?

But you also don’t have to limit yourself to two counties (and you shouldnt!). Use all the counties in a given state. you could do this with all the counties in the US - the challenge is finding the data for when stay at home orders went into effect at the county level.

The other strategy is try to estimate who probably has stayed home orders in place based on changes in behavior evident in the safe graph data. Basically use the degree of change in mobility as your treatment variable. You are no longer exactly measuring the effects of stay at home orders (since some reductions in mobility are the result of people just adopting social distancing as individuals), but you have all the data.

abbarcenasj commented 4 years ago

I believe we can get many observations of the unemployment rate pre-period at a county level.

One last question: If we use many counties for the diff-in-diff, we have to compare them pair-wise (a control county vs. a treatment county). How does the result should look like? The difference in unemployment for EACH of the pairs we compare? Is it ok to take the avg of the differences as a concluding remark of the effect in each state and/or in USA?

Is my question clear? @nickeubank

nickeubank commented 4 years ago

If we use many counties for the diff-in-diff, we have to compare them pair-wise (a control county vs. a treatment county)

No not at all – you just put them in a regression! Include fixed effects for each county and for each month, and you’re good to go!

The problem with the regression that Juli mentioned isn’t that you did a regression, it’s that without controlling for time somehow, you had a strong omitted variable bias, since stay at home orders are strongly correlated with later in the epidemic.

josemoscoso-duke commented 4 years ago

Thank you Nick

In the Safe Graph data we have two variables to measure the reduction of mobility of people:

Is it any criteria to use one over the other?

Best Jose Luis

nickeubank commented 4 years ago

That one’s on you to think through I think :)

you-juli commented 4 years ago

@nickeubank Sorry, although I read the comments a few times, I still don't really understand the exact problem with regression. Do you suggest there is a time series problem when you mentioned "controlling for time" and "stay at home orders are strongly correlated with later in epidemic"? Before the regression analysis, I did ensure there is no multicollinearity between features, which is one of the assumptions for regression, but I don't really know how to check the correlation with later time? Could you please further enlighten us? I am also free to have a call with you any time tomorrow if it's easier to explain through conversation.

nickeubank commented 4 years ago

What a great learning opportunity! @josemoscoso-duke Or @abbarcenasj , why don’t we start with one of you trying to answer @you-juli s question?

abbarcenasj commented 4 years ago

Sure, @nickeubank. What I understood is that for those states without order, we are averaging the last year of claims. While for the states with the order, we are averaging the claims after the order was placed (around the last 5 weeks or so). This means that our "claims" variable for the states without order, will be always lower because we are introducing information from the last 52 weeks (when the economic crisis was inexistent).

We have already explored averaging the last 5 weeks or so for those states without order, and the coefficient is still positive but less significant, as you suggested.

We are collecting data on a county level and we will run the same regression to get rid of the problem of states with different stay-at-home order starting dates. We are also looking at mobility data to define an "individual isolating measure" and look at the effect of individual isolating measures on unemployment.

nickeubank commented 4 years ago

@abbarcenasj Exactly!

But you don't have to reduce your sample to fix this -- you just need to control for the week of the data! If you have fixed effects for each week of data, that will control for overall trends in unemployment across the country (due to things like people choosing to social-distance on their own).

(Also, you can add county fixed effects! That will end up giving you a (continuous) difference in difference -- you're comparing variation within counties over time (due to the county fixed effects), while also controlling for changes over time at affect everyone (do to the time fixed effects))

abbarcenasj commented 4 years ago

@nickeubank, so far, for that regression we only have one row per state. And the other covariates we're controlling for, represent the information from the last date available. If instead of grouping the claims variable by averaging, we include several weeks of claims data per state (meaning several rows per state), is it ok to only replicate the data of the other covariate in each row? This because I don't think we can get weekly data for each of our additional covariates.

nickeubank commented 4 years ago

So if you use county (or state) fixed effects, you don't need the other covariates. State fixed effects account for anything non-time-varying.