As of the end of September 2023 the full PoC including the following methodology (see checklist below) has been completed. ✅
[x] Data collection and spatial exploratory data analysis. We’ll explore what patterns, over both space and time, can be observed from the cholera outbreaks themselves. We’ll also explore the literature to understand what remotely sensed environmental factors (e.g., precipitation, temperature) that have been suggested as drivers for disease spread.
[x] Development of pre-processing pipeline for remotely sensed EO data. We’ll develop a pre-processing pipe-line to ensure our satellite data is assembled and aggregated at the same level (i.e., monthly values for each district) as our outbreak data and ready to be ingested into a ML model.
[x] ML model exploration. We’ll explore a number of ML approaches (e.g., Random Forest, SVMs, etc.) to understand the patterns between cholera outbreaks and the environmental drivers we have identified.
[ ] Visualize model results and share findings. We’ll provide visuals of our model results and share our findings in a collection of Jupyter notebooks.
The hypothesis: Environmental factors alone won’t unravel this very complex relationship, but they can help identify spatio-temporal patterns that could help assist in allocating resources and support. has been tested and there is reason to support this hypothesis. That being said, the results of the classifier model could be improved (currently its high accuracy score is reflective of majority class only).
Further work around treatment of an imbalanced dataset needs to be explored. The following treatments SMOTE, ADASYN SMOTE and TOMEK Links have been applied with a variety of sampling strategies, with varying degrees of success. A sampling strategy of 0.1 (1:10 ratio of outbreak to non-outbreak events) as has been suggested by similar work in the literature have not proven as successful. A 50:50 ratio improves the model success, but is not reflective of real world scenarios.
That being said, there are some fine-tuning and further exploration I would recommend:
[ ] The creation of the train/test dataset splits follow a random sampling approach. However, a stratified dataset splitting could (and should be) explored, especially as we are dealing with a highly imbalanced dataset. By enabling stratified splitting, we would preserve the relative proportions of each class (outbreak=1, outbreak=0) across splits. See documentation here and here for enabling stratification in the train/test split.
[ ] Further explore methods for dealing with imbalanced datasets. Compare options for handling imbalanced data while keeping the model applied (e.g., RandomForest) constant. In this way, you can run a sensitivity analysis of how well the model's performance changes as a result of how you're handled the imbalanced dataset problem.
[ ] Once you feel comfortable, happy with a more reasonable performance of the baseline Random Forest model due to your handling of the imbalanced dataset (i.e., your model now performs reasonably classifying both outbreak and non-outbreak data) go ahead and compare the binary classification models. We have explored Random Forest as the literature suggested this approach was best suited for this kind of analysis, as well as SVMs, but there is a variety of others we didn't have time to explore. Use the combination of accuracy, F1 and ROC AUC scores to evaluate your classifier model's performance.
[ ] After you've explored these a bit further and feel happy with the results, go ahead and clean up the notebook - walk through the problem and determine the best visuals/comms for sharing the output of this work. Well done! 💫
As of the end of September 2023 the full PoC including the following methodology (see checklist below) has been completed. ✅
[x] Data collection and spatial exploratory data analysis. We’ll explore what patterns, over both space and time, can be observed from the cholera outbreaks themselves. We’ll also explore the literature to understand what remotely sensed environmental factors (e.g., precipitation, temperature) that have been suggested as drivers for disease spread.
[x] Development of pre-processing pipeline for remotely sensed EO data. We’ll develop a pre-processing pipe-line to ensure our satellite data is assembled and aggregated at the same level (i.e., monthly values for each district) as our outbreak data and ready to be ingested into a ML model.
[x] ML model exploration. We’ll explore a number of ML approaches (e.g., Random Forest, SVMs, etc.) to understand the patterns between cholera outbreaks and the environmental drivers we have identified.
[ ] Visualize model results and share findings. We’ll provide visuals of our model results and share our findings in a collection of Jupyter notebooks.
The hypothesis:
Environmental factors alone won’t unravel this very complex relationship, but they can help identify spatio-temporal patterns that could help assist in allocating resources and support.
has been tested and there is reason to support this hypothesis. That being said, the results of the classifier model could be improved (currently its high accuracy score is reflective of majority class only).Further work around treatment of an imbalanced dataset needs to be explored. The following treatments SMOTE, ADASYN SMOTE and TOMEK Links have been applied with a variety of sampling strategies, with varying degrees of success. A sampling strategy of 0.1 (1:10 ratio of outbreak to non-outbreak events) as has been suggested by similar work in the literature have not proven as successful. A 50:50 ratio improves the model success, but is not reflective of real world scenarios.
That being said, there are some fine-tuning and further exploration I would recommend: