developmentseed / geospatial-ds-cholera-lab

A repo dedicated to developing a geospatial data science prototype (see issue: https://github.com/developmentseed/labs/issues/292)
10 stars 2 forks source link

Identify methods to deal with imbalanced dataset #19

Open kathrynberger opened 1 year ago

kathrynberger commented 1 year ago

The dataset is greatly imbalanced (there are more "non outbreak" occurrences than there are "outbreaks") so we'll have to deal with accounting for this before training the ML model, while also representing true disease dynamics.

Some suggestions from the literature a 1:10 ratio to be used and SMOTE for dealing with Imbalanced classification problems.

But there are many other options, this issue will be to identify a list of potential appropriate options.

kathrynberger commented 1 year ago

Review of methodologies from recent literature:

Campbell et al. (2020) used a random forest classifier trained and tested to return binary output (1 for predicted outbreak, 0 for no outbreak), with a largely imbalanced dataset (i.e., 77 outbreaks and 8,504 non-outbreak data points found in a monthly time series for 40 coastal districts. Authors used Synthetic Minority Oversampling Technique (SMOTE) in the pre-processing stage, allowing for the generation of new examples of minority class based on lines drawn between random existing examples in the feature space using k-nearest neighbors. Authors identified a real-world ration of 1:10 (outbreaks vs. non-outbreaks) following a sensitivity analysis.

Leo et al. (2019) used an Adaptive Synthetic Sampling Approach (ADASYN), which is an improved version of the SMOTE in order to restore the sampling balance.

A gentle introduction to the family of SMOTE options can be found here.

A note from the above introduction on the key difference between ADASYN vs. SMOTE:

A tutorial using the imblearn python package can be found here.

Proposed next steps: