RubenRT7 commented 9 months ago

Challenge 23- Using Machine Learning to Emulate the Earth’s Surface

Stream 2 - Machine Learning for Earth Sciences applications

Goal

Evaluating and improving the performance of ECMWF’s current land surface Machine Learning model prototype.

Mentors and skills

Mentors: Ewan Pinnington, Christoph Herbert, Patricia de Rosnay, Peter Weston, Sébastien Garrigues, Souhail Boussetta, David Fairbairn (all ECMWF)
Skills required:
- Experience with Python programming
- Expertise in statistical analysis
- Background in Earth Sciences or related fields
- Some experience with Machine Learning desirable

ai-land-comp

Challenge description

Machine Learning (ML) is becoming increasingly important for numerical weather prediction (NWP), and ML-based models have reached similar or improved forecast scores than state-of-the-art physical models. ECMWF has intensified its activities in the application of ML models for atmospheric forecasting and developed the Artificial Intelligence/Integrated Forecasting System (AIFS). To harness the potential of ML for land modelling and data assimilation activities at ECMWF, a first ML emulator prototype has been developed (Pinnington et al. AMS Annual Meeting 2024). The ML model was trained on the "offline" ECMWF Land Surface Modelling System (ECLand) using a preselected ML training database. The current prototype is based on the information of model increments without introducing further temporal constraints and provides a cheap alternative to physical models. It opens up many application possibilities such as the optimization of model parameters and the generation of cost-effective ensembles and land surface initial conditions for NWP.

So far, a qualitative comparison between ECLand-based and emulated fields has been performed on a subset of sites, which revealed that the time series of land variables match well in terms of dynamic range and general trend behaviour. However, more targeted evaluation is required to assess the performance of the land emulator prototype. The aim is to understand the model's capabilities in reproducing the ECLand spatial and temporal patterns and its performance evaluated against in-situ observations.

Scope of the challenge:

The successful team will have the opportunity to contribute to the current efforts of the coupled assimilation and modelling teams in evaluating and improving the ML emulator prototype. The training database and model fields will be available in Zarr format at the European Weather Cloud. More information on the emulator can be found here: ec-land-emulator-git.pptx

What we offer:

• Advanced Python skills: packages Xarray, Zarr, Dask, PyTorch • Advancing first-of-its-kind land ML prototype • Tools for land model verification (LANDVER package)

The following steps are proposed to be carried out by the candidate(s) as part of the challenge:

• Comparison between emulated and ECLand variables: evaluation regarding different soil and vegetation types; capability of capturing the diurnal cycle and seasonal variability, revealing patterns of differences and similarities

• Assessment of the performance of the ML emulator: validation with in-situ soil temperature, soil moisture and surface flux observations using the land verification software (LANDVER) or possibly other ground-based observations (e.g. snow) using different verification metrics (correlation, RMSE)

• Testing the benefit of introducing time-varying Leaf Area Index (LAI): Apply the ML emulator using time-varying LAI as an input and assess the performance against ECLand which uses a fixed vegetation climatology

• Extension: selection of input features and target variables for model training; hyperparameter tuning and updating architecture; retraining of the ML model to improve selected variables, e.g. snow cover fraction, against observations and/or reanalysis.

amozaffari commented 8 months ago

Would it be possible to share the slide from Pinnington et al., AMS Annual Meeting 2024? Thanks! 🙏

yikuizh commented 8 months ago

Hi We are very interested in the challenge but for preparing the proposal, would you please share more information about the model of Pinnington et al., AMS Annual Meeting 2024, like the neural network structure, input/output, time and spatial scale etc. Thank you very much.

chris-herb commented 8 months ago

Hello,

Yes, of course! Please have a look at a set of relevant slides.

Cheers, Christoph

From: Yikui Zhang @.> Date: Thursday, 7. March 2024 at 5:17 AM To: ECMWFCode4Earth/challenges_2024 @.> Cc: Christoph Herbert @.>, Assign @.> Subject: Re: [ECMWFCode4Earth/challenges_2024] Challenge 23 - Using Machine Learning to Emulate the Earth’s Surface (Issue #12)

Hi We are very interested in the challenge but for preparing the proposal, would you please share more information about the model of Pinnington et al., AMS Annual Meeting 2024, like the neural network structure, input/output, time and spatial scale etc. Thank you very much.

— Reply to this email directly, view it on GitHubhttps://github.com/ECMWFCode4Earth/challenges_2024/issues/12#issuecomment-1982318551, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6HFDPQR67B3PCGFDILEAS3YW7S6BAVCNFSM6AAAAABDQY7E7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBSGMYTQNJVGE. You are receiving this because you were assigned.Message ID: @.***>

da-ewanp commented 8 months ago

Have uploaded the slides here too for convenience 🙂! ec-land-emulator-git.pptx

da-ewanp commented 8 months ago

Hi We are very interested in the challenge but for preparing the proposal, would you please share more information about the model of Pinnington et al., AMS Annual Meeting 2024, like the neural network structure, input/output, time and spatial scale etc. Thank you very much.

Please find the slides above with some more of this info 🙂. Thanks!

yikuizh commented 8 months ago

Hi We are very interested in the challenge but for preparing the proposal, would you please share more information about the model of Pinnington et al., AMS Annual Meeting 2024, like the neural network structure, input/output, time and spatial scale etc. Thank you very much.

Please find the slides above with some more of this info 🙂. Thanks!

Thanks a lot!

yikuizh commented 8 months ago

Hi I have another question about the main topic of this challenge: I noticed from the challenge description, that three steps work for the validation and only one step is for the model development. Should we focus on the validation of the current model as the main focus? Or is it flexible to choose our emphasis when preparing the proposal as those steps are just for reference? Thank you very much. Yikui Zhang

da-ewanp commented 8 months ago

Hi I have another question about the main topic of this challenge: I noticed from the challenge description, that three steps work for the validation and only one step is for the model development. Should we focus on the validation of the current model as the main focus? Or is it flexible to choose our emphasis when preparing the proposal as those steps are just for reference? Thank you very much. Yikui Zhang

Hi Yikui! Thanks for your question. The challenge is flexible, we have specified more validation as we thought this would be more achievable in the scope of the Code4Earth challenge and will also be very useful for ongoing activities at ECMWF. However, if you are already confident with the technologies used for model training and development of MLP's in general, then there is definitely more scope for focus on the model development and iteration too. Thanks, Ewan

tfohrmann commented 8 months ago

Hi, we are currently thinking about ways to do the verification, but are wondering what functionality the LANDVER package has? Is it used to bring the in-situ data into a format that can be compared to the model data? Does it already compute some statistics that can be used for verification? Thanks, Till

da-ewanp commented 8 months ago

Hi, we are currently thinking about ways to do the verification, but are wondering what functionality the LANDVER package has? Is it used to bring the in-situ data into a format that can be compared to the model data? Does it already compute some statistics that can be used for verification? Thanks, Till

Hi Till! Yes the LANDVER package includes in-situ observations of soil moisture, soil temperature and surface fluxes which are compared to model fields from the closest model grid point. It calculates lots of statistics like RMSE, MAE, correlation, etc. producing Taylor diagrams and bar charts of the results. We also have some model fields already processed which it will be good to compare the emulator with in the first instance to judge how well it mimics the full physical model and what fields it struggles to reproduce. Other novel sources of verification are welcome or if you'd prefer to use other observations/packages is good too! The emulator is currently predicting targets of soil moisture, soil temperature, 2m temperature, 2m dewpoint, skin temperature and snow cover fraction (with possibility to extend to additional flux variables quite easily). Thanks, Ewan

thisisrohan commented 8 months ago

Hi all, the link to the slides appears to be broken, could a fresh one please be added? Thanks, Rohan

chris-herb commented 8 months ago

Hi Rohan,

Thanks for letting us know! I checked again and I was able to download the slides: https://github.com/ECMWFCode4Earth/challenges_2024/files/14522324/ec-land-emulator.pptx

Cheers, Christoph

From: Rohan Kaushik @.> Date: Tuesday, 12. March 2024 at 10:58 PM To: ECMWFCode4Earth/challenges_2024 @.> Cc: Christoph Herbert @.>, Assign @.> Subject: Re: [ECMWFCode4Earth/challenges_2024] Challenge 23 - Using Machine Learning to Emulate the Earth’s Surface (Issue #12)

Hi all, the link to the slides appears to be broken, could a fresh one please be added? Thanks, Rohan

— Reply to this email directly, view it on GitHubhttps://github.com/ECMWFCode4Earth/challenges_2024/issues/12#issuecomment-1992654146, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6HFDPQVO42DGDDQQN2FCGTYX5275AVCNFSM6AAAAABDQY7E7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJSGY2TIMJUGY. You are receiving this because you were assigned.Message ID: @.***>

da-ewanp commented 8 months ago

Hi all, the link to the slides appears to be broken, could a fresh one please be added? Thanks, Rohan

Thanks for spotting this Rohan, I have updated the link to the slides in the challenge description and my previous comment too 🙏

amozaffari commented 8 months ago

Hi, Thank you for your quick response. I have a question regarding testing the impact of the time-varying LAI. Will the ECMWF provide a time-varying LAI map to be fed into the emulator? Additionally, do we need to adapt the emulator to receive time-varying LAI as input, or is it already capable of accepting it?

chris-herb commented 8 months ago

Hi Amirpasha, Maps of time-varying LAI will be provided. The current emulator is trained using fixed LAI, but would be interesting to see the benefits of applying the current model or training a new model using time-varying LAI.

amozaffari commented 8 months ago

Thanks @chris-herb 🙏

SamMajumder commented 8 months ago

Hello mentors and fellow participants,

I am interested in this challenge, and I'd like to participate. I am a first-time participant in the Code4Earth challenge, and I am really interested in this particular project. I have a couple of specific question regarding the submission process.

Do I independently start developing a proposal for this project and contact any of the mentors along the way if I have questions?

Do I need to run my proposal by the mentors of this project, prior to the final submission?

Any insight is greatly appreciated! I look forward to participating and all the best everyone!! :)

Sambadi

chris-herb commented 8 months ago

Hi Sambadi,

Thank you very much for your interest! Proposal will not be seen by the tutors before the application closing date.

This Thursday there will be a Q&A webinar from the Code4Earth coordination including the preparation of proposals. You can register here: https://codeforearth.ecmwf.int https://codeforearth.ecmwf.int/

Cheers, Christoph

From: Sam Majumder @.> Date: Tuesday, 19. March 2024 at 3:14 PM To: ECMWFCode4Earth/challenges_2024 @.> Cc: Christoph Herbert @.>, Mention @.> Subject: Re: [ECMWFCode4Earth/challenges_2024] Challenge 23 - Using Machine Learning to Emulate the Earth’s Surface (Issue #12)

Hello mentors and fellow participants,

I am interested in this challenge, and I'd like to participate. I am a first-time participant in the Code4Earth challenge, and I am really interested in this particular project. I have a couple of specific question regarding the submission process.

Do I independently start developing a proposal for this project and contact any of the mentors along the way if I have questions?

Do I need to run my proposal by the mentors of this project, prior to the final submission?

Any insight is greatly appreciated! I look forward to participating and all the best everyone!! :)

Sambadi

— Reply to this email directly, view it on GitHubhttps://github.com/ECMWFCode4Earth/challenges_2024/issues/12#issuecomment-2007297732, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6HFDPUQZL264DSGQH5XIVTYZBB57AVCNFSM6AAAAABDQY7E7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGI4TONZTGI. You are receiving this because you were mentioned.Message ID: @.***>

yikuizh commented 8 months ago

Hi We have some more questions here about the model and the dataset:

Are the ML emulator outputs already available to use, or do we need to run the ML model by ourselves before we can do the validation?
Has the LAI already been used in ECLand as well? If so, is it a climatology or dynamic LAI?
What is the spatial and temporal resolution of the ECLand model that the emulator has used?
We would like to ask what might be the rationale for evaluating the ML model against observation data? In this case, in our opinion, it would make more sense to only compare the ECLand and Emulator output as the ML model is only trained to emulate the ECLand model rather than the real-world observations. I am not sure if our understanding is correct about this point.

Thank you very much for your help! Kind Regards Yikui

da-ewanp commented 8 months ago

Hi We have some more questions here about the model and the dataset:

Are the ML emulator outputs already available to use, or do we need to run the ML model by ourselves before we can do the validation?

Has the LAI already been used in ECLand as well? If so, is it a climatology or dynamic LAI?

What is the spatial and temporal resolution of the ECLand model that the emulator has used?

We would like to ask what might be the rationale for evaluating the ML model against observation data? In this case, in our opinion, it would make more sense to only compare the ECLand and Emulator output as the ML model is only trained to emulate the ECLand model rather than the real-world observations. I am not sure if our understanding is correct about this point.

Thank you very much for your help! Kind Regards Yikui

Hi Yikui!

Thanks for the questions 🙂 . In response:

There will be ML emulator outputs ready to use, but the model will be setup to perform additional runs as required during the project by the candidate
Yes ECLand uses a climatological LAI and the emulator is trained on the ECLand run with the climatological LAI values. The emulator is trained to account for the effect of LAI varying in time, so we can run it with LAI which is not climatological as well.
We have trained the initial emulator at Tco399 (~30 km) spatial resolution with a time step of 6-hours
You make a very good point here and the main aim is indeed to compare the emulator to the ECLand model output. Additionally comparing to observations allows us to judge if the emulator is appropriate even if it isn't exactly mimicking the ECLand model in certain locations. As we can also re-run the emulator with time-varying LAI very simplistically (or tweaking other climatological variables) we can then compare back to observations to see what impact this might have. The current emulator also has the scope to be "fine-tuned" towards observations depending on the potential progress on the project. This being said the main aim is a thorough comparison to ECLand model output which will be provided on the project and if the secondary aim of including obs cannot be met this will still be sufficient.

I hope this helps and do let us know if you have any more questions! Thanks, Ewan

yikuizh commented 7 months ago

Dear mentors We have a question about the length of validation dataset. As shown in the page 3 of the slides, it seems that the training dataset of the emulator is from 2018 to 2021 while only 2022 was used as the testing(or validation) dataset. Does this mean that we can only have the 2022 dataset from ECLand to validate the emulator? Thank you very much for your help! Kind Regards Yikui

da-ewanp commented 7 months ago

Dear mentors We have a question about the length of validation dataset. As shown in the page 3 of the slides, it seems that the training dataset of the emulator is from 2018 to 2021 while only 2022 was used as the testing(or validation) dataset. Does this mean that we can only have the 2022 dataset from ECLand to validate the emulator? Thank you very much for your help! Kind Regards Yikui

Hi Yikui!

Good question! We have a dataset created from 2010-2023, so we can retrain a version of the emulator leaving more years for validation within this period.

Thanks 🙂 Ewan

ECMWFCode4Earth / challenges_2024