The python environment for the repository can be created using either conda
or virtualenv
, by running from the root of the repo:
conda create --name=ml-fuel python=3.8
conda activate ml-fuel
python3 -m venv env
source env/bin/activate
pip install -U pip
pip install -r requirements.txt
This includes all the packages required for running the code in the repository, with the exclusion of the notebooks in the folder notebooks/ecmwf
(see notebooks/ecmwf/README.md
for the additional dependencies to install).
The content of this repository is split into 2 types of experiments:
7 years of global historical data, from 2010 - 2016 will be used for developing the machine learning models. All data used in this project is propietary and NOT meant for public release. Xarray, NumPy and netCDF libraries are used for working with the multi-dimensional geospatial data.
The data split into training, testing and validation is currently:
To change the split, modify data_split()
in src/utils/generate_io_arrays.py
, and the month list in src/test.py
used during inference.
Raw data should first be processed using notebooks in notebooks/preprocess/*
.
Entry point for the pre-processing script for the ML pipeline is src/pre-processing.py.
Args description:
* `--data_path`: Path to the data files.
src/utils/data_paths.py
- defines the files paths for the features used in training and the paths of fuel_load.nc
which will be created.Output:
fuel_load.nc
file for Fuel Load Data (Burned Area * Above Ground Biomass).Saves the following files for the Tropics & Mid-Latitudes regions respectively, where {type} is 'tropics' or 'midlats'.
Save Directory root_path/{type}
* {type}_train.csv
* {type}_val.csv
* {type}_test.csv
Save Directory root_path/infer_{type}
* {type}_infers_July.csv
* {type}_infers_Aug.csv
* {type}_infers_Sept.csv
* {type}_infers_Oct.csv
* {type}_infers_Nov.csv
* {type}_infers_Dec.csv
Where root_path is the root save path provided for pre-processing.py
Entry-point for training is src/train.py
Args description:
* `--model_name`: Name of the model to be trained ("CatBoost" or "LightGBM").
* `--data_path`: Data directory where all the input (train, val, test) .csv files are stored.
* `--exp_name`: Name of the training experiment used for logging.
Entry-point for inference is src/test.py
Args description:
* `--model_name`: Name of the model to be trained ("CatBoost" or "LightGBM").
* `--model_path`: Path to the pre-trained model.
* `--data_path`: Valid data directory where all the test .csv files are stored.
* `--results_path`: Directory where the result inference .csv files and .html visualizations are going to be stored.
Pre-trained models are available at:
Notebooks for training and inference:
CatBoost for Mid-Latitudes
LightGBM for Tropics
.nc
format, containing data from 2010-16 and in 0.25x0.25 grid cell resolution.notebooks/EDA_pre-processed_data.ipynb
.src/utils/data_paths.py
. Further the path variable is needed to be added to either the time dependant or independant list (depending on which category it belongs to) present inside export_feature_paths()
.Documentation is available at: https://ml-fuel.readthedocs.io/en/latest/index.html.
We employ an AutoML approach to predict dry matter using the H2O.ai AutoML framework. Please refer to notebooks/ecmwf/README.md
for a description of this experiment, instructions to install additional dependencies and the notebooks with the steps to perform the experiment.
This repository was developed by Anurag Saha Roy (@lazyoracle) and Roshni Biswas (@roshni-b) for the ESA-SMOS-2020 project. Contact email: info@wikilimo.co
. The repository is now maintained by the Wildfire Danger Forecasting team at the European Centre for Medium-range Weather Forecast.