freedomofpress / fingerprint-securedrop

A machine learning data analysis pipeline for analyzing website fingerprinting attacks and defenses.
GNU Affero General Public License v3.0
29 stars 9 forks source link

Feature Selection #63

Open redshiftzero opened 8 years ago

redshiftzero commented 8 years ago

Many of our features are not very useful. We should include a first step of feature selection before passing the features matrix to the classifier. This could be something simple, e.g. a variance threshold, or something more complex. See a reference here in scikit-learn for how we can do this (no wheel invention necessary).

psivesely commented 8 years ago

The goal of this is to reduce computational time, not increase accuracy, correct? I would assume removing any features would have a negative effect on accuracy (although I assume for features with extremely low variance--to use one metric of usefulness--this would be negligible).

redshiftzero commented 8 years ago

You're certainly right that it will make things faster, but feature selection is primarily for improving our classifier's results (on our metrics: AUC, etc.). If we add a lot of not-very-useful features, then we are adding a lot of noise which makes the learning problem significantly harder. It means we'll need more data to explore a much larger feature space and we also run the risk of the classifier picking up on noise and overfitting.