Open redshiftzero opened 8 years ago
The goal of this is to reduce computational time, not increase accuracy, correct? I would assume removing any features would have a negative effect on accuracy (although I assume for features with extremely low variance--to use one metric of usefulness--this would be negligible).
You're certainly right that it will make things faster, but feature selection is primarily for improving our classifier's results (on our metrics: AUC, etc.). If we add a lot of not-very-useful features, then we are adding a lot of noise which makes the learning problem significantly harder. It means we'll need more data to explore a much larger feature space and we also run the risk of the classifier picking up on noise and overfitting.
Many of our features are not very useful. We should include a first step of feature selection before passing the features matrix to the classifier. This could be something simple, e.g. a variance threshold, or something more complex. See a reference here in scikit-learn for how we can do this (no wheel invention necessary).