developmentseed / geospatial-ds-cholera-lab

A repo dedicated to developing a geospatial data science prototype (see issue: https://github.com/developmentseed/labs/issues/292)
10 stars 2 forks source link

Geospatial Cholera Lab

A repository dedicated to developing a geospatial data science prototype (see issue: https://github.com/developmentseed/labs/issues/292).

The Objective

To explore the use of machine learning techniques on publicly available, open-sourced datasets to demonstrate the potential to predict cholera in endemic regions of the world, which could be developed further as part of a public health planning and decision making tool for humanitarian organizations. Develop a PoC based only on open-source data to showcase ML capabilities in this space which could be developed further to support decision tool development in this space, and provide more context to cholera patterns than is provided by cases alone.

Literature Support

In cholera-endemic countries, there is support of environmental signatures between seasonal outbreaks which could be explored and used to develop a framework for an early warning system. See also The seasonality of cholera in sub-Saharan Africa: a statistical modelling study, for supporting work in this area.

The Challenge

Proposed open-source, available datasets

Focus on an area where cholera has been identified as a major issue, and where subnational and sub annual surveillance data is available: Sub-Saharan Africa. Data availability during this time frame will also allow us to take advantage of a number of remotely sensed variables captured over the same time-frame.

Cholera outbreak data

Environmental drivers

Below are a list of potential indicator datasets for inclusion into the Cholera Lab study based on literature support (Gwenzi & Sanganyado 2019; Lessler et al. 2018; Perez-Saez et al. 2022; Moore et al. 2017, and others outlined below more specifically below)

Based on available Indicators for both spatial and temporal extent of our AOI (Sub-Saharan Africa from 2010-2019) we will extract the following environmental parameters for our investigation.

Variable Temporal Resolution Spatial Resolution Data Availability Data Source
Land Surface Temperature monthly 1.11 km 1995-2020 CEDA
Precipitation monthly 5 km 1981- near present CHIRPS, with multiple access points, including USCB Storage and SERVIR GLOBAL
Soil Moisture daily 0.25 degrees; approx 27-28 km 1991-2021 ESA Climate Data Dashboard

Proposed Methodology

  1. Data collection and spatial exploratory data analysis. We’ll explore what patterns, over both space and time, can be observed from the cholera outbreaks themselves. We’ll also explore the literature to understand what remotely sensed environmental factors (e.g., precipitation, temperature) that have been suggested as drivers for disease spread.
  2. Development of pre-processing pipeline for remotely sensed EO data. We’ll develop a pre-processing pipe-line to ensure our satellite data is assembled and aggregated at the same level (i.e., monthly values for each district) as our outbreak data and ready to be ingested into a ML model.
  3. ML model exploration. We’ll explore a number of ML approaches (e.g., Random Forest, SVMs, etc.) to understand the patterns between cholera outbreaks and the environmental drivers we have identified.
  4. Visualize model results and share findings. We’ll provide visuals of our model results and share our findings in a collection of Jupyter notebooks.

Hypothesis

Environmental factors alone won’t unravel this very complex relationship, but they can help identify spatio-temporal patterns that could help assist in allocating resources and support.

Setting up your local environment

If you are running macOS, consider installing Homebrew, if not already installed, as there are macOS-specific instructions below that make use of homebrew that can simplify the setup process.

Install Git Large File Storage

This repository contains files larger than 50 MB, and thus requires the use of Git Large File Storage (LFS) for managing them. In order to obtain these large files during repository cloning, you must [install Git Large File Storage].

On macOS, the easiest way to install Git LFS is via Homebrew:

brew install git-lfs

Once installed, initialize it:

git lfs install

To track new types of large files (larger than 50 MB), you must tell Git LFS to track them, typically by extension. For example, to track all Shapefiles:

git lfs track "*.shp"

You can then add and commit such files like any other file in the repository.

Note that the git lfs track command will modify the .gitattributes file when given a new pattern to track. When this occurs, be sure to add .gitattributes to your commit, along with the newly tracked large files.

[Install Git Large File Storage]: https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage

Install conda and create conda environment

Install conda. The recommended way to do this is by installing miniforge:

brew install miniforge
conda init

Then, close your terminal and open a new terminal session.

Once, conda is installed, run the following commands in your terminal from the root of this repository to create the environment used for this repository:

conda env create
conda activate geo-ds-cholera

Whenever you modify the environment.yml file, run the following command to update your conda environment:

conda env update

If you haven't already done so, create a .env file at the root of this repository (ignored by git), which you can perform by making a copy of .env-example, like so:

# This copies .env-example to .env, unless .env already exists
cp -n .env-example .env

Edit your .env file, setting values as appropriate for yourself, as this file is not committed to git, and thus is not shared with others because it intended to contain sensitive, user-specific values. Some parts of the code in this repository will load values from your .env file, and thus may either fail to run or skip certain parts of logic, if your .env file does not contain properly configured values.

In order to allow notebooks in this repository to import modules in this repository, you must perform a local, editable pip install:

pip install -e .

Install pre-commit and pre-commit hooks

To aid development, this repository uses the pre-commit tool, which is installed into the conda environment created above. To install the pre-commit hooks defined in .pre-commit-config.yaml, you must run the following command from the root of your cloned repository working directory:

pre-commit install --install-hooks

If you wish to run the pre-commit hooks in order to check your changes prior to committing your changes to git, you can run the following command, but note that files that are untracked by git will be ignored by the pre-commit hooks. Therefore, if there are untracked files that you wish to check, you must at least use git add to stage them in order for the pre-commit hooks to check them:

pre-commit run -a

Reproducing the Results

After setting up your local environment (see above), you may reproduce our results as follows:

  1. Run exploration/zonal-means.ipynb to reproduce the individual zonal means CSV files under the data directory. The inputs to this notebook are the outbreaks.csv and shapefile found under the src/cholera/resources path.
  2. Run exploration/aggregate-zonal-means.ipynb to reproduce the aggregate zonal means CSV file under the data directory. The inputs are the individual zonal means produced by the previous step.