Open karinalungo opened 4 years ago
Good question. For the purposes of this class, if the two independent variables are completely correlated (-1, or 1), we do not want to include both in our model, but if they are not completely correlated then we want to leave both independent variables in so we can try to isolate the independent effects of each. If you take one of these variables out, you have to be able to explain why you did what you did (the methodology of your model), and it is important to note that if you take out a variable you will expect the slope to change. For more reading on multicollinearity and a few ways to think about what is happening when you have two highly-correlated independent variables, you can look at these two sources. If you have additional questions, please post them. https://www.statisticshowto.com/multicollinearity/ and https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/#:~:text=Multicollinearity%20occurs%20when%20independent%20variables,model%20and%20interpret%20the%20results.
In general when you have high correlation of two covariates / high multicollinearity then they will essentially cancel each other out in the model.
One important consideration is when multicollinearity is a good thing and when it's a bad thing.
BAD:
When you have a variable that is important for your study - you want a good estimate of the slope - and you have a control variable that is another measure of the same construct then including both will prevent you from having an accurate interpretation of the results. For example, income and years of education will be highly correlated, and thus will mostly cancel each other out. Another example might be including high school GPA and SAT scores in a model predicting academic performance in college. Those two scores will be highly correlated and essentially measure the same thing.
OK:
If these are both included as control variables to improve the model, but you are not directly interpreting either, then it's not a problem to include both.
GOOD:
A competing hypothesis is specifically an alternative explanation to your policy variables. In this case, if your measure of your competing hypothesis is highly correlated with your policy variable you definitely want to include it in the model because you are trying to eliminate all of the alternative explanations. It is actually a good thing if your policy effect disappears once you include your competing hypothesis, because it might mean the other explanation is causing the outcome, not your policy. You policy slope will only be significant if it can explain some portion of the DV independent of all other explanations.
For example, adding SES in the education model makes classroom size no longer significant (until you control for TQ). In the crack babies study, once you account for nutrition the impact of drug use almost disappears.
Multicollinearity is mostly a problem when you have two variables that are measuring the same thing, and thus they will be highly correlated, and thus cancel each other out. Again, not a problem if they are just control variables. But if you plan to interpret either slope you would want to make sure you are not accidentally minimizing their influence (minimizing the slope) by including redundant measures.
A competing hypothesis is not a redundant measure. It is an alternative explanation.
In looking at the answers to this week's assignment, I see a comment about competing hypothesis... I pasted the comment below. My question is: is it then best practice to include competing variables / highly correlated variables in a model? My guess was telling me that if two variables were highly correlated then they could be considered to be "the same" and therefore I could choose either one of them but maybe not the two of them since I would be "duplicating" influencing variables in the model... can you talk more about what the best practice is for what to do with variables that are highly correlated?
"This is a COMPETING HYPOTHESIS because we see high levels of coffee consumption and high levels of stress together, so it is hard to say which is actually causing increased heart rates. We need to include both in the same model to isolate the independent effects of each. Or similarly, in this case if we remove stress from the model we would expect that the slope associated with caffeine will change."