azmfaridee / mothur

This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.
https://github.com/mothur/mothur
GNU General Public License v3.0
3 stars 1 forks source link

On Negative Impact of Feature Selection on Bagging #26

Open azmfaridee opened 11 years ago

azmfaridee commented 11 years ago

@kdiverson I've come across a paper called On Feature Selection, Bias-Variance and Bagging by M. Arthur Munson and Rich Caruana where the authors state that prior feature selection before running any Bagging algorithm negatively impacts the accuracy.

In our case, we've also notices similar phenomenon when implementing standardDeviationThreshold where we are doing a very crude feature selection by pruning garbage/null features. I've noticed that error rate has increased when I've set the a standardDeviationThreshold to something more than 0.1. Remember that Random Forest is an instance of Bagging procedure.

The authors claim that Bagging algorithms perform better with noisy data. On the other hand, as we've seen, there are a lot of semi null features that we are discarding with standardDeviationThreshold, so these can be considered noisy data as well.

On the flip-side of the coin, discarding these semi null features might help with the reduction of over-fitting and minimization of local optima. Some times superfluous leaves are created with the help of this semi-null features, which might be an indication of over-fitting.

As such, I think the best way would be a tradeoff, setting standardDeviationThreshold to 0.1 or less which is a very low value, so that we can get the best of both worlds.

P.S. I've added the paper in FeatureSelectionResources/FeatureSelectionAndBagging folder in Dropbox

azmfaridee commented 11 years ago

This particular issue can be very important for Issue #34. Can doing a prior F-Score help positively in Bagging or does it affect negatively?