EsperanzaCuartero commented 1 year ago

Challenge 23 - FloodMule: a machine learning emulator of the LISFLOOD hydrological model

Stream 2 - Machine Learning for Earth Science

Goal

Emulate LISFLOOD to reduce significantly the running time of the model for a given configuration

Mentors and skills

Mentors: Corentin Carton, Cinzia Mazzetti, Matthew Chantry, Juan Pereira Colonese, Francesca Moschini, Eleanor Hansford
Skills required:
- Good knowledge of machine learning approaches and libraries
- Good knowledge of Python
- Knowledge of hydrological modelling is not required but would be an advantage

Note: Only nationals or residents from the ECMWF Member States and Co-operating States are eligible to participate (see Terms and Conditions).

Challenge description

LISFLOOD is a spatially distributed (gridded) hydrological rainfall-runoff model that can simulate the main hydrological processes occurring in a catchment. LISFLOOD explicitly considers the spatial distribution of physical properties across the catchments to provide estimates of river discharge and other hydrological variables such as snow accumulation, soil moisture, etc. Driven by meteorological forcing data (precipitation, temperature and evaporation), it calculates a complete water balance for every grid cell of the computational domain.

Running the LISFLOOD hydrological model at high resolution and global (or pan-European) scale, as will be done in the next versions of EFAS and GloFAS, becomes a challenge as the running time of the model becomes too large for an operational context. Instead of optimising the current model, which would only give incremental improvement, emulating the hydrological model using machine learning could give us orders of magnitude of improvement in terms of speedup with hopefully limited or no degradation of results.

The emulator would mimic the hydrological model for a given configuration, meaning:

A freezed version of LISFLOOD
A fixed domain and resolution
A fixed set of static maps (gridded) that describe the hydro-morphological characteristics of the river basins, including the parameter maps obtained through the model calibration process
A single temporal step, removing the temporal complexity of the problem

This would result in a simple workflow for the emulator with the following inputs:

Initial conditions given through LISFLOOD state maps
LISFLOOD forcing maps for one step, such as temperature, precipitation, etc.

The emulator, as the hydrological model, would provide the following outputs:

State maps representing different variables of the hydrological model, which could potentially be used as the initial condition for a next step
Some additional maps for variables such as discharge, snow melt, etc.

This very well-defined problem offers a multitude of areas of exploration for training the model, as we could build a training dataset by feeding into the hydrological model any set of the initial condition and forcing and using the outputs to train the emulator. For instance, the ML training could be based on one of the following approaches:

Use of existing dataset for forcing and state files from reanalysis, forecasts, reforecasts, etc.
Creating stochastic dataset around climatological data obtained through the reanalysis

These two approaches would already give us thousands of data points (i.e. time slices) to train the model, even millions if the stochastic approach is successful.

As a continental domain is composed of thousands of hydrological catchments, the approach could first experiment on small-size basins, then scaled up to larger basins and finally to the full EFAS or GloFAS computational domain.

The details of the implementation, such as the data flow or the ML approach and libraries, will be discussed during the project. The candidates will be provided with a utility, interfacing with the LISFLOOD hydrological model, that will generate training datasets for the ML kernels.

Training/evaluation workflow: pic_FloodMule

simonmoulds commented 1 year ago

Hi team – thanks for putting together this challenge. Before I prepare my proposal I had a few questions:

Do you have a specific ML architecture in mind? There are many examples of emulators using conventional ML (e.g. SVM, RF), but more recently examples of deep learning approaches (e.g. CNN, LSTM) – e.g. to emulate ParFlow.
(Related to above) Is it your intention to emulate all of LISFLOODs model states/fluxes, or only a subset?
Will the emulator also perform river routing, or is the intention to focus only on the grid water balance?
Will it be important to constrain the emulator with any physical laws (e.g. mass conversation)?
What is the target spatial resolution? At this resolution will the model still represent subgrid heterogeneity (e.g. through fractional land cover).
Reading the challenge, it seems that having a stochastic weather generator could be valuable to increase the size of the training data. Is developing such a tool part of the challenge?

Many thanks for your help!

corentincarton commented 1 year ago

Hi @simonmoulds,

Thanks for your interest! Here are some answers to your questions:

Do you have a specific ML architecture in mind? There are many examples of emulators using conventional ML (e.g. SVM, RF), but more recently examples of deep learning approaches (e.g. CNN, LSTM) – e.g. to emulate ParFlow.

We don’t have a fixed architecture in mind, but we suspect that a deep learning approach will be most suitable to capture the complexity and do so quickly. We encourage proposal to specify what architectures the participants think will be suitable.

(Related to above) Is it your intention to emulate all of LISFLOODs model states/fluxes, or only a subset?

The intention is to emulate all LISFLOOD state/fluxes that are necessary to restart the model (35 in total), plus river discharge.

Will the emulator also perform river routing, or is the intention to focus only on the grid water balance?

The goal of the emulator is to emulate the LISFLOOD model as a whole, which includes river routing.

Will it be important to constrain the emulator with any physical laws (e.g. mass conversation)?

Mass conservation would be nice to have. Feel free to make suggestions in your proposal.

What is the target spatial resolution? At this resolution will the model still represent subgrid heterogeneity e.g. through fractional land cover).

The target resolution is 1arcmin (~1.5 km). Subgrid heterogeneity is represented using fractions.

Reading the challenge, it seems that having a stochastic weather generator could be valuable to increase the size of the training data. Is developing such a tool part of the challenge?

ECMWF will provide the training datasets. Augmentations to the dataset could be an interesting tool, but we would prioritise getting a model trained on the existing data to assess whether this is sufficient. If there is time we could explore a stochastic weather generator.

Don't hesitate if you have more questions!

simonmoulds commented 1 year ago

Thanks @corentincarton - this is really helpful. I will start to prepare my submission and get in touch with any other queries as they arise.

simonmoulds commented 1 year ago

Hi @corentincarton - just wanted to apologise for not submitting an application in the end. I was recently offered a lectureship in hydrology at the U of Edinburgh and I decided that, along with my current position at Oxford, I wouldn't have the time to do this project justice.

Did you get any other applicants? If not, and if this is something you would like to continue to pursue, I may be able to devote some time to it (albeit at a slower pace than the Code4Earth timeline). Let me know!

Best wishes, Simon

corentincarton commented 1 year ago

Thanks for this answer @simonmoulds, we would be happy to further discuss this! We'll contact you in private :)

Congrats for your position in Edinburgh! Corentin

ECMWFCode4Earth / challenges_2023