DS4PS / cpp-523-fall-2020

http://ds4ps.org/cpp-523-fall-2020/
0 stars 3 forks source link

Omitted variable bias and Multicollinearity #8

Open ASU-RASHWAN opened 3 years ago

ASU-RASHWAN commented 3 years ago

I have below concerns :

1- Are omitted variable bias and multicollinearity has the same effect on the regression model?

2- What are the most effective ways to find omitted variable bias and multicollinearity on our regression model?

3-How efficient omitting a variable from our regression model that is correlated with both DV, IV variables, can we exclude it if it has no impact explaining more from our unexplained error portion?

4- Can we still consider omitting a variable if our policy slope is significant then becomes insignificant, can we consider our model is significant even if it's biased?

Schlinkert commented 3 years ago

We talked a lot about multicollinearity and omitted variables in this week's review. Please take a look at the first page of this week's review guide (also pasted below), and review this week's video recording because I think it will give you a good idea about how to think about these topics. If you still have questions, please post them, and I will get back to you.

Multicollinearity There are two types of multicollinearity to address. There is perfect multicollinearity, which happens when we use dummy variables (have to omit a category). We will address this further next week. And the other is when variables are highly correlated. Lots of textbooks refer to this as the multicollinearity problem. But that suggests you should just avoid including variables that are highly correlated, which is why it can be a misleading concept. The important question is when the high correlation will cause inferential problems. That happens mostly when you include redundant measures, which can occur frequently without realizing it since so many measures can be proxies of other measures. Zip codes can proxy for race, for example, or socioeconomic status and education, etc. There is no such thing as multicollinearity in the dependent variable. There is only the strength of the correlation between the outcome of interest and the explanatory variables in the model. Multicollinearity specifically refers to the correlation between independent variables. Multicollinearity leads to inflated standard errors. This is because it removes a lot of the variation of X, thus making the denominator of the standard error smaller and the whole term bigger. We don't care about multicollinearity if it occurs between two control variables. It might change their standard errors but we don't necessarily pay attention to the significance level of the controls, just of the policy variable. Highly correlated control variables will still do their job - accounting for extra variance in the dependent variable (and thus making the standard error of the policy variable smaller) and also making sure that there is not omitted variable bias in the policy slope. The problem occurs when our policy variable is highly correlated with control. In this case including the control could inflate the standard error of the policy variable and we could see a statistically non-significant slope. The takeaway: It is important to refrain from including redundant measures of the same concept rather than focusing on the level of correlation and whether that qualifies as multicollinearity. When you decide to leave a control variable out of your model, you need to be able to justify why you left a variable out of your model, and the impact that decision will have on the final results of your analysis or prediction. A way to see this in your model, when you add another control, is if your standard errors go way up.