102207429 / Expedia

數據科學與大數據分析
0 stars 0 forks source link

How to make the correlation or collinearity of variables low #9

Open GregoryWu opened 7 years ago

GregoryWu commented 7 years ago

If we can find out the way to bring down the correlation between variables, the variances of the trees are supposed to be brought down as well. Therefore, the better performance is resulted.

How do you think we can make the correlation between variables be low? The only way I could think of is to delete variables which are correlated with others, and keep only one out. (e.g. if variables A,B,C,D are highly correlated, then we can keep one of them.)

102207429 commented 7 years ago

Sorry for late reply.

We can use PCA(principal components analysis) or factor analysis to remove the features which have low contribution. Or we can use ridge regression. Ridge regression allow bias to lower the standard deviation. But I would prefer PCA since it performs better from my experiences.

Actually your way to delete variables is good.(can use stepwise regression) Maybe we can try PCA and stepwise regression both.

GregoryWu commented 7 years ago

So do you mean that PCA can automatically delete the useless variables? As I know, the function of PCA is to lower the complexity of the model. I think we can find out which variables are correlated with each other and treat them as a group. Then use PCA to compress them as one feature...what do you think? Because we are using gradient boosting, its parameters can be tuned to do something like a ridge regression or lasso during the training process.

102207429 commented 7 years ago

Agree with your opinion. I've just use the correlation coefficient first, easier and reliable, please take a look at the newest issue. Thanks.