Identify methods to deal with imbalanced dataset

Review of methodologies from recent literature:

Campbell et al. (2020) used a random forest classifier trained and tested to return binary output (1 for predicted outbreak, 0 for no outbreak), with a largely imbalanced dataset (i.e., 77 outbreaks and 8,504 non-outbreak data points found in a monthly time series for 40 coastal districts. Authors used Synthetic Minority Oversampling Technique (SMOTE) in the pre-processing stage, allowing for the generation of new examples of minority class based on lines drawn between random existing examples in the feature space using k-nearest neighbors. Authors identified a real-world ration of 1:10 (outbreaks vs. non-outbreaks) following a sensitivity analysis.

Leo et al. (2019) used an Adaptive Synthetic Sampling Approach (ADASYN), which is an improved version of the SMOTE in order to restore the sampling balance.

A gentle introduction to the family of SMOTE options can be found here.

A note from the above introduction on the key difference between ADASYN vs. SMOTE:

ADASYN uses a density distribution, as a criterion to automatically decide the number of synthetic samples that must be generated for each minority sample by adaptively changing the weights of the different minority samples to compensate for the skewed distributions.
SMOTE generates the same number of synthetic samples for each original minority sample.

A tutorial using the imblearn python package can be found here.

Proposed next steps:

incorporate SMOTE as we only have a binary classification problem and do not have to account for multiple minority sampling classes.

developmentseed / geospatial-ds-cholera-lab

Identify methods to deal with imbalanced dataset #19