General question about cluster analysis

mtwelker commented 2 years ago

Prof. @AntJam-Howell , When doing cluster analysis, is it generally best to have as many relevant variables as possible for your inputs, or do you need to be cautious about using correlated variables? We learned that in OLS regression, it isn't generally a good idea to include highly correlated variables in your model (for example, percent with a high-school diploma or less and percent with a college degree or higher or percent married and percent unmarried), but we threw a ton of correlated variables into this cluster analysis for Phoenix. Is that typically how cluster analysis is done? Thank you! Michelle Welker

AntJam-Howell commented 2 years ago

Hi @mtwelker, Great question. Unfortunately, with many concepts there will be varying degrees of (dis)-agreement in answering this question and how to handle it. In the simplest scenario, lets just assume that we enter in two variables that are perfectly correlated in a cluster analysis, the two variables measure the exact same concept essentially (say income and some other perfectly correlated variable, meant to represent the measure of poverty), and that concept (poverty) will then receive twice the weight when clustering as compared to other variables that represent different concepts. This would be OK if we believe that poverty may be doubly important as compared to other concepts, but if we want to assume that all concepts be equivalent in determining clusters, then high collinearity would be problematic. We would rely on theory to drive the selection of problems to help resolve this issue. In practice, it would also be good to simulate clusters in an iterative fashion adding (or removing) variables and checking the sensitivity of cluster outputs.

Hope that helps. best,

mtwelker commented 2 years ago

That makes sense. Thank you!

Watts-College / cpp-529-fall-2021

General question about cluster analysis #4