This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.
@kdiverson I've come across a paper called On Feature Selection, Bias-Variance and Bagging by M. Arthur Munson and Rich Caruana where the authors state that prior feature selection before running any Bagging algorithm negatively impacts the accuracy.
In our case, we've also notices similar phenomenon when implementing standardDeviationThreshold where we are doing a very crude feature selection by pruning garbage/null features. I've noticed that error rate has increased when I've set the a standardDeviationThreshold to something more than 0.1. Remember that Random Forest is an instance of Bagging procedure.
The authors claim that Bagging algorithms perform better with noisy data. On the other hand, as we've seen, there are a lot of semi null features that we are discarding with standardDeviationThreshold, so these can be considered noisy data as well.
On the flip-side of the coin, discarding these semi null features might help with the reduction of over-fitting and minimization of local optima. Some times superfluous leaves are created with the help of this semi-null features, which might be an indication of over-fitting.
As such, I think the best way would be a tradeoff, setting standardDeviationThreshold to 0.1 or less which is a very low value, so that we can get the best of both worlds.
P.S. I've added the paper in FeatureSelectionResources/FeatureSelectionAndBagging folder in Dropbox
@kdiverson I've come across a paper called On Feature Selection, Bias-Variance and Bagging by M. Arthur Munson and Rich Caruana where the authors state that prior feature selection before running any Bagging algorithm negatively impacts the accuracy.
In our case, we've also notices similar phenomenon when implementing standardDeviationThreshold where we are doing a very crude feature selection by pruning garbage/null features. I've noticed that error rate has increased when I've set the a standardDeviationThreshold to something more than
0.1
. Remember that Random Forest is an instance of Bagging procedure.The authors claim that Bagging algorithms perform better with noisy data. On the other hand, as we've seen, there are a lot of semi null features that we are discarding with standardDeviationThreshold, so these can be considered noisy data as well.
On the flip-side of the coin, discarding these semi null features might help with the reduction of over-fitting and minimization of local optima. Some times superfluous leaves are created with the help of this semi-null features, which might be an indication of over-fitting.
As such, I think the best way would be a tradeoff, setting standardDeviationThreshold to
0.1
or less which is a very low value, so that we can get the best of both worlds.P.S. I've added the paper in
FeatureSelectionResources/FeatureSelectionAndBagging
folder in Dropbox