RubenRT7 commented 4 months ago

Challenge 20 - Bridge the Gap: Bridging Gaps in Streamflow Observations with ML-driven Solutions

Stream 2 - Machine Learning for Earth Sciences applications

Goal

Develop machine learning solutions to bridge gaps in streamflow observations, enhancing the accuracy and reliability of hydrological data analysis and forecasting.

Mentors and skills

Mentors: Maliko Tanguy, Gwyneth Matthews, Mariana Clare, Cinzia Mazzetti (all ECMWF)
Skills required:
Essential:
- Python (numpy, pandas, xarray, ...)
- Machine learning (scikit-learn, Pytorch/Tensorflow)
- Visualisation (mapping, graphs)
- Desirable:
- Time series analysis
- Open-source collaboration (Git)
- Advantageous:
- Ability to create clear documentation / communication
- Basic hydrology understanding

Challenge description

Introduction Operational flood forecasting systems like EFAS and GloFAS, part of the Copernicus Emergency Management Service (CEMS), play a pivotal role in providing advanced warnings for devastating flood events, significantly impacting societies worldwide. These systems must be reliable and accurate, making the assessment of forecast skill a critical aspect in gauging their trustworthiness and utility. A major limitation in calibrating and evaluating these forecasting systems is the scarcity, quality, and incompleteness of observational data, particularly in areas where flood impacts are most severe. In addition, the calculation of some forecasting skill scores such as the Continuous Ranked Probability Skill Score (CRPSS) necessitates continuous time series, posing a challenge when data is unavailable or incomplete. Extending the time series also allows for the provision of reference or climatology values against which to compare forecasts, enhancing the robustness of the evaluation process. Building upon existing literature (e.g. [1,2,3]), various ML methods, such as Random Forests and LSTM models, have shown promise in gap-filling river flow data. However, a comprehensive understanding of their strengths and limitations is essential for informed implementation.

Project objectives The primary objective is to explore different approaches to gap-fill observed daily streamflow time series, comparing their performance and determining the maximum length of gap that can be reliably filled. The project aims to implement these methods into an open-source software package based on Python, providing a user-friendly solution for filling gaps in observational datasets.

Methodology

Data Collection: Observed river flow data from GRDC and catchment average precipitation data from ERA5 will be provided for a subset of river gauging stations used in GloFAS and EFAS. The inclusion of remote sensing water level data could also be considered, with a focus on addressing associated challenges (e.g. data accuracy, resolution, temporal and spatial coverage).
Selection of methods: Based on a brief review of existing literature on the topic, the team will select a few different statistical and ML methods to be implemented and compared. Proposals should focus on head catchments but ideas of how to manage nested catchments are also encouraged.
Coding and Implementation: Open-source software, predominantly Python, will be used for the implementation of different gap-filling methods. The coding phase will be organised into milestones to ensure a systematic and timely execution of the project.
Evaluation and Comparison: A comprehensive evaluation will be conducted, comparing different methods based on general performance and considering the size of the data gap. The team will develop strategies for assessing performance variations with an increase in gap size, providing valuable insights into the reliability of each method.

Expected outcome The project’s final outcome will be a well-documented, user-friendly Python code available on GitHub, featuring one or several gap-filling options. Accompanying this code will be information on method performance, including the maximum reliable gap size and a degradation table detailing performance with increasing gap size, which will help users to select the best method for their data.

Strech goals (optional) Ready for an extra challenge? For those eager to push their limits, we offer optional stretch goals:

Extend the application of these methods to temporally disaggregate time series to refine data resolution (e.g. from monthly to daily river flow data).
Evaluate the use of these methods to extend time series, beyond gap-filling.

References [1] Arriagada et al. (2021) [2] Dariane & Borhan (2024)
[3] Ren et al. (2022)

RonT23 commented 3 months ago

Hello, we are interested in this challenge and have a few questions:

Will the dataset be provided upon (or if) acceptance of the proposal, or should we reference the dataset in the proposal?
Our proposal addresses the problem in a more abstract and general manner; do we need to provide more specific details?
Could you please provide an example dataset for us to study the features? We found something related to the problem at hand but we are not pretty sure that it is the right one. Thank you in advance!

ecMaliko commented 3 months ago

Hi @RonT23 , Thank you for your interest and questions about our challenge! Regarding questions 1 and 3: I'm checking with my fellow mentors about this and will let you know ASAP. Regarding question 2: We’re looking for proposals with clear steps and milestones rather than abstract solutions. A more detailed plan will help ensure a tangible outcome by the end of the challenge. We are happy with some flexibility, as there will inevitably be some unexpected issues along the way, and some trial and error with some methodological issues. But within that flexibility, the more specifics you can provide, the better.

ecMaliko commented 3 months ago

Hi @RonT23 I can now confirm that we will be able to provide you with daily river flow observations at GRDC sites, and catchment averaged ERA5 precipitation for these sites. I haven't checked the data, but I think we have a few thousands sites across the world, with variable length of record. We will prepare the data before the start of the challenge. We don't have any remote sensing data though. Therefore, if you are planning to use these in your project, this would need to be sourced by yourselves.

I could prepare some sample data for a few of these sites for you to explore, but it would take me a couple of days. Could you please confirm you would like me to prepare the sample data for you?

Best wishes,

Maliko

daniel-obrien commented 3 months ago

Hi Maliko,

It would be very helpful for use if you could prepare that data.

Thank you, Daniel

RonT23 commented 3 months ago

Hi @ecMaliko, We would appriciate it if you can provide us with a sample dataset! Thank you, Ronaldo T.

ecMaliko commented 3 months ago

Hi @RonT23 and @daniel-obrien , I have attached here some sample data (100 stations) for you to explore the format and type of data that will be provided. The full dataset will have a few thousands stations. There is one netcdf file with observed discharge data, and another one with catchment averaged precipitation data from ERA5. In addition, I have also included a csv file with some additional metadata. The 'statid' variable in the netcdf files corresponds to the 'station_id_num' column in the metadata file. Please note that the reference date in the precipitation file is different from the discharge file! Let me know if you have any questions. Regards, Maliko

sample_data_code4earth.zip

RonT23 commented 3 months ago

Thank you, that is really helpfull! R. T.

KonstantinosPl commented 3 months ago

Dear @ecMaliko 1) Are there going to be any gauge stations that belong to the same catchment (water basin)? 2) Is the average ERA5 precipitation derived from the same catchment? 3) Can we use the distributed ERA5 precipitation data? 4) Will we have the distinction between rainfall and snow and hail? 5) Can we add more input data that affect the precipitation-streamflow relationship in our models?

Thanks in advance.

K. P.

ecMaliko commented 3 months ago

Dear @KonstantinosPl ,

Thank you for your interest in this challenge!

These are the answers to your question:

Good point: some catchments will be nested (smaller catchments being sub-catchments of bigger catchments). We will flag which catchments are nested, this will be prepared before the start of the challenge. We will also provide a shapefile of catchments.
The average ERA5 precipitation provided is an average over each of the individual catchments.
You can use the distributed ERA5 precipitation data, but you will have to source the data yourselves if you decide to use it. This might increase your data preparation time substantially.
We don’t have information on the precipitation type (rainfall, snow, hail). This information is available in ERA5, but again, keep in mind the additional data preparation time that this would add.
You can add any input that you might think is relevant, as long as it is data openly available.

While all the data you mention would surely contribute to improve the final product, don’t forget that the challenge is only 4 months. Therefore, make sure your proposed work is realistic within that timeframe.

Let me know if you have further questions!

Maliko

danghieutrung commented 3 months ago

Hi @ecMaliko,

I have some questions following this discussion:

I have reviewed the data files you attached and noticed that the small data you sent contained around 4-5 days of data (9 - 13 Jan 1970). Could you give us an approximation of the time range of the actual data for the project? I speculate the whole dataset would contain somewhat 40-50 years, from 1970 to around 2010, 2020.
Does one model have to apply to all stations, or different stations from different geographical area could use different models? For example, we may implement an LSTM model for each continent (Europe, Asia,...), and all LSTM Models should have the same architecture (same configurations with equal number of parameters), but the weights are different.
Do we have access to any GPU server during the project?

Thank you! Hieu

ecMaliko commented 3 months ago

Hi @danghieutrung ,

My colleagues are on Easter break, so I will reply to the best of my knowledge, and I will get back to you with updated information as soon as I hear from them.

I apologise that the sample data that I had shared only had 5-7 days of observed data. I hadn’t checked the amount of data it included (maybe I should have), as it was mainly to share the format and type of data that would be provided. We have a mixture of sites with quite complete records, and others with very little data. I am afraid I can’t give you an accurate estimate of the exact amount of observed data that we hold at this moment. The paper from Chevuturi et al. (2023) can give you an idea of the amount of data available in the GRDC dataset (a subset of 119 stations), if you look at their figure S1 in supplementary information (the white parts are all missing data on this plot): https://ars.els-cdn.com/content/image/1-s2.0-S0022169423005498-mmc1.pdf
It doesn’t necessarily need to be one single model for all stations. If you think different models for different continents (or other subsets of stations) would work better, you are welcome to propose this in your project. It is also possible that some continents won’t have enough data to train the model (we know there is more data in Europe and the US than elsewhere), you might only be able to build models for the regions of the world with more dense data.
I think you would have access to GPUs, but I am not 100% sure. Let me come back on this point once I manage to talk to my colleagues.

Maliko

BargavReddyM commented 3 months ago

Hello, we are interested in this challenge, and I have a question:

Are Indian nationals allowed/eligible to apply for this (I am from India)
If allowed, is it mandatory to work specifically for the study area of European nations, or can any study area in the world be chosen?

ecMaliko commented 3 months ago

Dear @BargavReddyM ,

Thank you for your interest in this challenge. Unfortunately, for this year’s Code4Earth challenges, the call is only open to candidates who are citizens from ECMWF Member States and Co-operating States. You can find the list here: https://www.ecmwf.int/en/about/who-we-are/member-states We wish we could be more open, but this is restricted by the conditions set by our funders. Regarding the second question: we are more interested in the methodology developed rather than the specific area used to develop the method. Therefore, it doesn’t necessarily need to be based in Europe. However, Europe is one of the most data-rich area (in terms of river flow), and therefore it could be a good starting point.

Kind regards,

Maliko

BargavReddyM commented 3 months ago

Thank you for the reply

trakasa commented 3 months ago

Hi @danghieutrung, hi @ecMaliko

I think you would have access to GPUs, but I am not 100% sure. Let me come back on this point once I manage to talk to my colleagues.

AT: Yes, that is correct. Thanks @ecMaliko for answering! If the selected proposals need access to computing resources you can access the European Weather Cloud or WEkEO.

Bye, Athina

trakasa commented 3 months ago

@BargavReddyM @ecMaliko

Thank you for the reply AT: Indeed, as funding comes from different (European) sources, we have to follow certain rules for eligibility. You have to be citizen or resident of an ECMWF Member State or Co-operating State or EU Member State, or from a country associated with EU’s Space Programme (currently Iceland, Norway and United Kingdom) and countries associated with EU’s Digital Europe Programme (currently Albania, Iceland, Lichtenstein, Montenegro, North Macedonia, Norway, Serbia and Türkiye).

For more details please check the Code for Earth Terms & Conditions (mainly Article 3).

Thanks @ecMaliko for getting back to Bargav!

Bye, Athina

wsyip85 commented 2 months ago

Hello. I could not submit my proposal because the link to submit the form said refused to connect. May I have some help please ? Here is the link from the website. https://codeforearth.commpla.com/ecmwf-code-for-earth-2024-submission-form

wsyip85 commented 2 months ago

Thank you, the link is now okay.

ecMaliko commented 2 months ago

Hi @wsyip85 I am glad the problem is now solved. Kind regards, Maliko

ECMWFCode4Earth / challenges_2024