ECMWFCode4Earth / challenges_2019

Have a look at the challenges proposed for the 2019 edition of ECMWF's Summer of Weather Code.
44 stars 6 forks source link

Challenge #7 - Weather and Climate Benchmarks for Machine Learning #7

Closed jwagemann closed 3 years ago

jwagemann commented 5 years ago

Challenge 7

Weather and Climate benchmarks for machine learning

Development of software and datasets + development of a documentation and a small webpage to interface with potential users of benchmarks for machine learning.

Goal: To develop and publish benchmark datasets for machine learning applications in weather and climate modelling that can be used as testbed for new approaches by the community


Mentors @dueben
Skills required - Some Fortran and Python coding
- A good understanding how to handle GRIB and NetCDF files
- an interest in machine learning AND weather and climate models

Challenge description

Why do we need weather and climate benchmark datasets for machine learning?

Recent developments in machine learning will likely have a strong impact on the way how we will perform weather forecasts in the future. Machine learning may be used much more frequent in data pre- or post-processing, data-assimilation, model developments and modelling in general, visualisation, and many other areas. There is a whole set of new machine learning tools that are tested for use in various aspects of weather models with the most prominent two families of tools being deep neural networks and decision trees. However, there is a big obstacle that makes it difficult to apply these techniques in weather and climate applications. Each family is coming with plenty of hyperparameters that need to be optimised by the user (for example the choice of the activation and loss function, the number and width of layers of neurons, the amount of training data, the choice of dense, convolutional or adversarial networks…) and there is hardly any experience how to design machine learning solutions that are optimal for our domain. In other domains, the zoo of machine learning methods is tested against benchmark datasets such as MNIST for the recognition of handwritten numbers (https://en.wikipedia.org/wiki/MNIST_database) to compare different approaches and to understand which tools are optimal for specific domains and applications. While there are many experts in machine learning who would like to test new methods on weather and climate related data, there are no such benchmark datasets for weather and climate applications available yet.

What is a weather and climate benchmark datasets?

A weather and climate benchmark dataset consists of input/output pairs of a physical parametrisation scheme that is used within the OpenIFS model. The datasets are saved during a model integration and brought into a shape that allows to train machine learning tools to emulate the specific parametrisation scheme with minimal effort using for example random forests or deep neural networks. The data can be downloaded for testing by interested users via the internet.

A short documentation is written up for each benchmark that provides information on the input and output fields and the leading equations of the underlying parametrisation scheme. A diagnostic/loss function of interest is specified and explained to allow a fair comparison between different methods. Furthermore, a benchmark emulator is presented in form of a trivial neural network configuration and a recipe how to feed such a network back into OpenIFS is provided to allow testing of new machine learning solutions within free-running simulations of the model.

If possible, the benchmark documentation and background information is published as a Journal paper (e.g. in Geoscientific Model Development).

Datasets will be developed for several parametrisation schemes such as radiation, cloud-physics, and the land surface scheme.

What infrastructure is required?

hydrogo commented 5 years ago

Dear @dueben and @jwagemann, is this challenge limited by using OpenIFS data or it is possible to propose the development of a benchmark dataset which will utilize ERA5 data with some other open data sources?

Thank you.

Best, Georgy

dueben commented 5 years ago

Dear Georgy,

Many thanks for your interest!

I guess that It would be possible to also extract the relevant information from other datasets such as ERA5. However, it is important for this challenge that the training set is representing a physical process that is relevant within weather and climate models. I guess that it would be difficult for ERA data to assemble all relevant information for input and output pairs (in particular the output fields will typically not be given). As an alternative to free-running OpenIFS simulations a single column model could be forced by ERA data to extract those fields.

I hope this clarifies the issue. Please let me know if you have further questions.

All the best, Peter

jwagemann commented 5 years ago

Hi @hydrogo , we have two challenges related to Machine Learning this year. Challenge #14 focuses on open datasets, such as ERA5 and is the focus is still very open - depending also what the participant is interested in. The development of a benchmarking dataset could be the focus there as well. HTH, Julia

Chiil commented 5 years ago

Good to hear about this great initiative at UCP2019. I do not fully understand the organization yet. Will ECMWF provide the required GRIB / NetCDF files with input and output (and maybe intermediate results) to participants as part of the collaboration, and the participants work this out into a data set suited for machine learning benchmarking?

dueben commented 5 years ago

Dear Chiel,

It would be optimal if the input/output pairs would be generated in a joint effort between the participant and ECMWF. This should also involve domain experts of the parametrisation scheme in question to make sure that all relevant fields are present. However, if this is a problem, the input/output pairs could also be generated and shared by ECMWF.

All the best, Peter

hydrogo commented 5 years ago

Dear @dueben and @jwagemann, Thank you for the provided clarification. I am familiar with similar studies (e.g., Gentine et al., 2018), but unfortunately, I am not an expert in meteorology and cannot provide a reliable formulation of parametrization schemes in terms of input-output mapping. This way, I will probably try to participate in Challenge #14.

Gentine, P., Pritchard, M., Rasp, S., Reinaudi, G., & Yacalis, G. (2018). Could machine learning break the convection parameterization deadlock?. Geophysical Research Letters, 45(11), 5742-5751.

tommylees112 commented 5 years ago

Dear @dueben and @jwagemann, I am not familiar with Fortran code but would love to contribute to this challenge. my first question is whether this is a big problem? Secondly, xan time be spent at ECMWF in reading or is this all done remotely? Kind regards Tommy

dueben commented 5 years ago

Hi Tommy, Some experience with Fortran would probably help but is not strictly mandatory for the project. Maybe you can team up with others who know Fortran? Julia can correct me if I am wrong but I guess there is no real limitation or requirement regarding visits at ECMWF. Since you are based in Oxford, it would be easy and useful to come to ECMWF to coordinate and update the progress once in a while. However, this is again not really a requirement.

I hope this helps.

All the best, Peter

jwagemann commented 5 years ago

Hi @tommylees112 , @dueben is right. Even though the set up of ESoWC is virtual / online, it is up to the mentor at ECMWF and participant how to organise the coding phase. If both parties agree that it would be helpful, there is no restriction to not do so. HTH, Julia

jwagemann commented 5 years ago

REMINDER: Deadline to register and submit your proposal is upcoming Sunday, 21 April at 23:59 GMT!

Application process is a 2-step process:

Applications without a submitted proposal will not be taken under consideration! We are looking forward to your proposal!