Open GregoryWu opened 7 years ago
Sorry for late reply.
We can use PCA(principal components analysis) or factor analysis to remove the features which have low contribution. Or we can use ridge regression. Ridge regression allow bias to lower the standard deviation. But I would prefer PCA since it performs better from my experiences.
Actually your way to delete variables is good.(can use stepwise regression) Maybe we can try PCA and stepwise regression both.
So do you mean that PCA can automatically delete the useless variables? As I know, the function of PCA is to lower the complexity of the model. I think we can find out which variables are correlated with each other and treat them as a group. Then use PCA to compress them as one feature...what do you think? Because we are using gradient boosting, its parameters can be tuned to do something like a ridge regression or lasso during the training process.
Agree with your opinion. I've just use the correlation coefficient first, easier and reliable, please take a look at the newest issue. Thanks.
If we can find out the way to bring down the correlation between variables, the variances of the trees are supposed to be brought down as well. Therefore, the better performance is resulted.
How do you think we can make the correlation between variables be low? The only way I could think of is to delete variables which are correlated with others, and keep only one out. (e.g. if variables A,B,C,D are highly correlated, then we can keep one of them.)