Omitted variable bias and Multicollinearity

We talked a lot about multicollinearity and omitted variables in this week's review. Please take a look at the first page of this week's review guide (also pasted below), and review this week's video recording because I think it will give you a good idea about how to think about these topics. If you still have questions, please post them, and I will get back to you.

Multicollinearity There are two types of multicollinearity to address. There is perfect multicollinearity, which happens when we use dummy variables (have to omit a category). We will address this further next week. And the other is when variables are highly correlated. Lots of textbooks refer to this as the multicollinearity problem. But that suggests you should just avoid including variables that are highly correlated, which is why it can be a misleading concept. The important question is when the high correlation will cause inferential problems. That happens mostly when you include redundant measures, which can occur frequently without realizing it since so many measures can be proxies of other measures. Zip codes can proxy for race, for example, or socioeconomic status and education, etc. There is no such thing as multicollinearity in the dependent variable. There is only the strength of the correlation between the outcome of interest and the explanatory variables in the model. Multicollinearity specifically refers to the correlation between independent variables. Multicollinearity leads to inflated standard errors. This is because it removes a lot of the variation of X, thus making the denominator of the standard error smaller and the whole term bigger. We don't care about multicollinearity if it occurs between two control variables. It might change their standard errors but we don't necessarily pay attention to the significance level of the controls, just of the policy variable. Highly correlated control variables will still do their job - accounting for extra variance in the dependent variable (and thus making the standard error of the policy variable smaller) and also making sure that there is not omitted variable bias in the policy slope. The problem occurs when our policy variable is highly correlated with control. In this case including the control could inflate the standard error of the policy variable and we could see a statistically non-significant slope. The takeaway: It is important to refrain from including redundant measures of the same concept rather than focusing on the level of correlation and whether that qualifies as multicollinearity. When you decide to leave a control variable out of your model, you need to be able to justify why you left a variable out of your model, and the impact that decision will have on the final results of your analysis or prediction. A way to see this in your model, when you add another control, is if your standard errors go way up.

DS4PS / cpp-523-fall-2020

Omitted variable bias and Multicollinearity #8