Investigate rebalancing the training set

redshiftzero commented 7 years ago

We have a very imbalanced machine learning problem, where we have far fewer SecureDrop users than non-SecureDrop users. There are many ways of handling this situation - including oversampling the minority class or undersampling the majority class. Some of the techniques used for machine learning with very skewed classes are implemented in this library: https://github.com/scikit-learn-contrib/imbalanced-learn, so we could give some of these a try.

psivesely commented 7 years ago

@redshiftzero and I discussed this in person for a minute and whether we should increase the monitored_nonmonitored_ratio in fpsd/config.ini. We decided to leave it for now, but in the future if we realize we want more SD data it might be better to bump that from 10 to 100, which would give us roughly a 50:50 class split in terms of frontpage_traces. That's not to say there isn't good stuff in the library linked and we shouldn't see what we can get out of some of the functionality there. The conclusion was that getting more raw data will give more accurate results than oversampling from the same data-set where you are essentially replicating traces. Let me know if I missed anything here @redshiftzero.

psivesely commented 7 years ago

Matthews correlation coefficient (sklearn.metrics.matthews_corrcoef) "is used in machine learning as a measure of the quality of binary (two-class) classifications... generally regarded as a balanced measure which can be used even if the classes are of very different sizes."

freedomofpress / fingerprint-securedrop

Investigate rebalancing the training set #61