ECMWFCode4Earth / challenges_2022

Discover the ECMWF Summer of Weather Code 2022 challenges.

12 stars 3 forks source link

Challenge 31 - Flood forecasting: the power of citizen science #9

Open jwagemann opened 2 years ago

jwagemann commented 2 years ago

Challenge 31 - Flood forecasting: the power of citizen science

Stream 3 - Applied data science for weather, climate and atmosphere

Goal

Develop a Python package to facilitate the use of crowdsourced hydrological measurements for forecast validation

Mentors and skills

Mentors: Marie-Amelie Boucher, Cinzia Mazzetti, Florian Pappenberger, Jan Seibert, Juan Colonese
Skills required:
- Python
- Machine learning
- Image analysis
- Pattern recognition
- Basic geomatics

Challenge description

Why do we need a solution Floods are one of the biggest disasters killing countless numbers of people and destroying properties. Forecasting these killers is important to reduce such impacts. The key to improving these forecasts are observations and in particular new types of observation such as crowdsourced data that offer significant opportunities.

Recently, exciting initiatives such as CrowdWater have turned information from people into incredible rich scientific data. In the case of crowdsourcing, people send geo-referenced pictures of streams or rivers along with the corresponding variations of water level. Thousands of data have been gathered over the world like that, covering areas where no other observation is available. The challenge is to convert this precious data into something that can be used in flood forecasting models so that the information is not lost but used to improve the models to help save a life. This project is about solving two key challenges that stop CrowdWater information to be used in the CEMS GloFAS flood forecasting system:

1) locate the CrowdWater data points onto GloFAS rivers, which are a simplified representation of true rivers 2) Convert CrowdWater information into data consistent with GloFAS.

Data and software We plan to use CrowdWater virtual stations located on larger rivers and drainage networks from CEMS (EFAS and GloFAS). We plan to use OpenStreetMap to identify rivers and derive metadata. There is also a possibility of using synthetic data (i.e. designed to replicate data that could be obtained by CrowdWater in the future in addition to the data series which already exist.

What could be the solution We are looking for a solution that will 1) transform water level variations to a variable that can be used for verification of GloFAS and EFAS forecasts and 2) map CrowdWater virtual stations to GloFAS and EFAS points. This can be achieved through a variety of methods, for instance by mimicking the human mapping procedure, through the use of image analysis and/or pattern recognition techniques to match the real river to the representation of the model and then map the stations to the correct model pixels, also exploiting additional metadata such as the station name or the river name. Another possibility is to compute stations' upstream drainage area by using a digital elevation model (DEM) and geomatics tools in Python. The mapping of each station should ideally include a quality flag showing a confidence level in the mapping result.

Ideas for the implementation We envisage that implementation might include the following steps: Using a selection of CrowdWater stations for which there also exists an official river gauge, train a machine learning (ML) model to learn the relationship between water level variations, other explanatory variables, and streamflow. Then, use this ML model to translate water level variations into streamflow for all candidate CrowdWater stations (whether or not an official river gauge is also available)

Extract the river map for the area surrounding the station and the available metadata, such as rivers names from OpenStreetMap or any other open dataset. Another option is to compute the station’s upstream drainage area using a DEM and geomatics tools.

Map the station using coordinates and metadata (like the name of the river or the name of a nearby location).

sagarikajadon13 commented 2 years ago

Hi, I'm Sagarika, a BTech junior majoring in computer science and engineering. I am skilled in python, machine learning, deep learning and have worked on image classification problems before. I would love to contribute to this project. If I understand correctly, our main task is to predict the water level variations, streamflow and other attributes for all the stations in the CrowdWater data. For that, we plan to use the data (both images and metadata) already available from GloFAS or OpenStreetMap. I would like to know do we have to create our training data and train the model from scratch or build upon an existing model. If yes, could you provide an access to it so that I can better understand the problem. Thanks!

QueenMABhydro commented 2 years ago

Hi Sagarika, I'm very happy that you are interested by the challenge! The main task is not prediction. GloFAS already predicts streamflow. The main task is to see if we can used citizen science data (from the CrowdWater database) to verify GloFAS forecasts at places where there are no data available by traditional means (gauging stations). You can have a look at CrowdWater data here https://crowdwater.ch/en/data/. As for the association with GloFAS, you will at least need the upstream area file, from which the model's river network can be obtained. This upstream area file is available here https://confluence.ecmwf.int/display/COPSRV/GloFAS+mapping+locations+onto+the+river+network

I will check if I can provide any other information/tool.

Best,

Marie-Amélie

sagarikajadon13 commented 2 years ago

Hi Marie-Amélie, Thanks for your reply! If I now understand correctly, we're only supposed to map the CrowdWater data points onto the GloFAS river network (which already predicts the streamflow) using an ML model. I tried to take a look at the upstream are file available at https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-historical?tab=overview but was unable to do so since I'm not familiar with the geomatics software used. Would you recommend any related tools or reading material so that I can understand the problem better. Thank you!

QueenMABhydro commented 2 years ago

Hi! Yes, the first part is to map the CrowdWater data points into the GloFAS river network. This involves finding "relevant" CrowdWater points, since, as those are data taken by anyone, most CrowdWater points are for very small streams that don't exist in GloFAS. CrowdWater data points have all sorts of issues, for instance only one measurement (useless) or incorrect measurements. Those would have to be detected first, so there is a cleanup or sorting phase prior to the mapping. Then, once the mapping is done, the team would have to figure out how to use the data from CrowdWater to validate forecasts from GloFAS. GloFAS forecasts streamflow, but CrowdWater data is variations of water level relative to an arbitrary scale that changes between locations. The upstream area file is a netcdf file. You can open it in Python using the netcdf4 package. You can also open it directly in QGIS. In order to see the river network in QGIS, you will have to click on the layer (upstream area) in the bottom left layer panel window of QGIS. Then properties, then symbology, then in the render type (top left cornerish) singleband pseudocolor, then set the min and max values as 100000000 and 10000000000. Then set a colour scheme (choose one colour ramp) and apply.

Marie-Amélie

sagarikajadon13 commented 2 years ago

Hi Marie-Amélie, I was able to view both the CrowdWater data file and the upstream area file in QGIS. Like you mentioned, the CrowdWater data does seem to have a lot of NaN and incorrect values. I'll start with the EDA process now. However, I'm still not sure about how to work with the upstream area file. Would you give any pointers on that part? Thank you!

QueenMABhydro commented 2 years ago

Hi! Well, designing a matching method using the upstream area file is an integral part of the challenge. At this point, we are answering questions from potential participants, but the team is not selected yet. It will start this summer. I think it would be better to wait for the result of the selection before investing too much time in the challenge. Obviously, you are more than welcome to apply!

Marie-Amélie

sagarikajadon13 commented 2 years ago

Hi, Is there any previous project along the lines of this one that I can see? Thank you!

QueenMABhydro commented 2 years ago

Hi!

The codes and project descriptions from the previous years are all available on GitHub, here: https://github.com/esowc/ Unfortunately I'm not familiar with all the projects, so you would have to browse and see if there is something similar.

Marie-Amélie

sagarikajadon13 commented 2 years ago

Thank you so much for your assistance!

itsmohitanand commented 2 years ago

Dear Mentor @colonesej @QueenMABhydro

I just had a few questions?

For crowd water data, where I can find the descriptions of each of the columns?
Probably I am understanding it incorrectly, but need some help regarding how do GLoFAS networks look like? It is said that the major difficulty is matching the point? (What exactly does this mean?)

locate the CrowdWater data points onto GloFAS rivers, which are a simplified representation of true rivers

Convert CrowdWater information into data consistent with GloFAS.

From these sentences, it looks like we have already a river representation like a raster (saying where we have a river and where we don't) Can you share a little more in this regard?

QueenMABhydro commented 2 years ago

Hi, I have prepared a file containing the description of each column of the crowdwater data. It is here https://www.dropbox.com/s/vq1g7l1ao9j9rxd/Metadata.tsv?dl=0

It is a tab separated file (there are comas in the descriptions).

Regarding the matching: GloFAS has a relatively coarse representation of rivers. Only large rivers are represented. On the opposite, most Crowdwater stations (but not all) are for small streams. So first, some crowdwater points will not exist in GloFAS (and in EFAS). Also, forecasts (GloFAS or EFAS) are issued specific points (with lat lon coordinates), but represent a whole drainage area. The upstream area associated to each point is in the UpArea.nc file. For CrowdWater data, you will have the lat/lon coordinates of the point, but no upstream area. The crowdwater data lat/lon coordinates might be more or less accurate. It is usually not sufficient to do the matching only by comparing lat/lon points. You also have to make sure the drainage areas are the same. So... it might not be straightforward to do the spatial matching.

I hope this helps!

Marie-Amélie

itsmohitanand commented 2 years ago

Thanks @QueenMABhydro! This is really helpful, to follow up on this, would it be right to say the idea here is to match the variation of theWATER_LEVEL column in the CrowdWater data, to the preidctions from GLoFAS (or EFAS). We can potentially use other variables like FLOW_TYPE, SNOW_ICE_PRESENT (may be more/less) to do a first filtering of the data points we wish to consider.

Coming to the next part. I would like to summarise the problem here, let us suppose there are two points of forecasts for GLoFAS (or EFAS) - O1 and O2 belonging to two different catchments D1 and D2. We have a measurement at a point P (as shown in the image), such that distance PO2 (d2) is smaller than the distance PO1 (d1). Still P might belong to catchment D1, and hence point matching would not work.

The first part of the challenge would be to find the catchment (drainage area) to which point P belongs. We also have the raster of the UpArea.nc file, and the challenge is reduced to finding a point inside a polygon. This might be computationally expensive (depending upon the number of catchments in the UpArea.nc), but can be perfectly determined in most of the cases.

The next challenge is to relate the CrowdWater WATER_LEVEL data variation with the outlet water level variation. There might be one machine learning model doing this for all the points or one model for each catchment.

This is my understanding considering that the GLOFAS(or EFAS) is a perfect model, which we know is not the case.

GloFAS and EFAS forecasts and 2) map CrowdWater virtual stations to GloFAS and EFAS points. This can be achieved through >a variety of methods, for instance by mimicking the human mapping procedure, through the use of image analysis and/or >pattern recognition techniques to match the real river to the representation of the model and then map the stations to the >correct model pixels, also exploiting additional metadata such as the station name or the river name.

Based on the description of this part, we are trying to correct the model prediction. Suppose that the GLOFAS (or EFAS) prediction is not accurate (in space), then we correct the drainage area (catchment delineation) and find the right location for the prediction from GLoFAS or EFAS, and solve the above mentioned two problems. Is it the correct description of the challenge? Knowing this would really help me propose better solutions.

itsmohitanand commented 2 years ago

Thanks @QueenMABhydro! This is really helpful, to follow up on this, would it be right to say the idea here is to match the variation of theWATER_LEVEL column in the CrowdWater data, to the preidctions from GLoFAS (or EFAS). We can potentially use other variables like FLOW_TYPE, SNOW_ICE_PRESENT (may be more/less) to do a first filtering of the data points we wish to consider.

Coming to the next part. I would like to summarise the problem here, let us suppose there are two points of forecasts for GLoFAS (or EFAS) - O1 and O2 belonging to two different catchments D1 and D2. We have a measurement at a point P (as shown in the image), such that distance PO2 (d2) is smaller than the distance PO1 (d1). Still P might belong to catchment D1, and hence point matching would not work.

The first part of the challenge would be to find the catchment (drainage area) to which point P belongs. We also have the raster of the UpArea.nc file, and the challenge is reduced to finding a point inside a polygon. This might be computationally expensive (depending upon the number of catchments in the UpArea.nc), but can be perfectly determined in most of the cases.

The next challenge is to relate the CrowdWater WATER_LEVEL data variation with the outlet water level variation. There might be one machine learning model doing this for all the points or one model for each catchment.

This is my understanding considering that the GLOFAS(or EFAS) is a perfect model, which we know is not the case.

GloFAS and EFAS forecasts and 2) map CrowdWater virtual stations to GloFAS and EFAS points. This can be achieved through >a variety of methods, for instance by mimicking the human mapping procedure, through the use of image analysis and/or >pattern recognition techniques to match the real river to the representation of the model and then map the stations to the >correct model pixels, also exploiting additional metadata such as the station name or the river name.

Based on the description of this part, we are trying to correct the model prediction. Suppose that the GLOFAS (or EFAS) prediction is not accurate (in space), then we correct the drainage area (catchment delineation) and find the right location for the prediction from GLoFAS or EFAS, and solve the above mentioned two problems. Is it the correct description of the challenge? Knowing this would really help me propose better solutions.

I was just going through the dataset and I realise that the value of the area is given by UpArea.nc and not the spatial information about the catchment. This might make things tricky. Is it so?

QueenMABhydro commented 2 years ago

Hi,

I will try to address all your questions one by one: 1) "would it be right to say the idea here is to match the variation of theWATER_LEVEL column in the CrowdWater data, to the preidctions from GLoFAS (or EFAS). We can potentially use other variables like FLOW_TYPE, SNOW_ICE_PRESENT (may be more/less) to do a first filtering of the data points we wish to consider."

Yes, that is right

2) " (...) The first part of the challenge would be to find the catchment (drainage area) to which point P belongs. We also have the raster of the UpArea.nc file, and the challenge is reduced to finding a point inside a polygon. (...)"

Also correct.

3) "The next challenge is to relate the CrowdWater WATER_LEVEL data variation with the outlet water level variation. There might be one machine learning model doing this for all the points or one model for each catchment."

Yes, but I think it would probably be difficult to fit a ML model for every single point, mainly because there are not that many measurements for each point. It would probably be good to think about some grouping method, potentially by catchment, yes, but there might also be other possibilities.

4) "This is my understanding considering that the GLOFAS(or EFAS) is a perfect model, which we know is not the case."

They are indeed imperfect, but there are ways to potentially take that into account, at least to some extent, given that the forecasts are ensembles.

5) "Based on the description of this part, we are trying to correct the model prediction."

Correcting the forecasts (post-processing) would be a step further than what is proposed in the challenge, but it could be included in your proposal if you want.

6) "I realise that the value of the area is given by UpArea.nc and not the spatial information about the catchment. This might make things tricky. Is it so?"

Well, yes, it is definitely so. One additional information that could be helpful is the information from real official gauging stations, because the metadata for those stations usually includes the area of the catchment at the station. Then, if you can match the GloFAS area with the area for a station, this could be a good starting point to obtain more spatial info about the catchment. But of course, if a catchment is ungauged (appart from CrowdWater points), you won't have that information.

I hope this helps!

Marie-Amélie

itsmohitanand commented 2 years ago

@QueenMABhydro Yes, exactly this is what I needed. I have some interesting ideas to match the point with the catchment. It will be interesting to find the ways in which we can group data points. Need to think about that little more. Just started with the proposal. I will come back if I have more doubts.

itsmohitanand commented 2 years ago

@QueenMABhydro This is about water_level. Now these virtual scales can potentially mean different values in different streams. For example, a difference of 1 unit in the virtual scale, corresponding to the difference in the volume of water, would differ with locations. I think this is what is meant by

We are looking for a solution that will 1) transform water level variations to a variable that can be used > for verification

This is a hard problem, as we don't have the data about what a unit difference means in terms of hydrological variables. Can I have some suggestions on what kind of data we have to resolve this problem?

One of the suggested thing is to train a ml model on water level variation with flow values at gauge station, but then there is no reason for it to generalise, unless we provide the dimension of the stream.

I might be missing something, anything on this would be really helpful.

QueenMABhydro commented 2 years ago

Hi! The training will indeed probably be station-specific. One possibility would maybe be to use the data from the Crowdwater game and implement an image recognition scheme. You can see the game on the CrowdWater website, and there has been a thesis that briefly discusses this :(https://www.zora.uzh.ch/id/eprint/190608/1/2020_PhD_thesis_Barbara_Strobl.pdf)

It is indeed a difficult problem. It wouldn't be a challenge if it wasn't challenging ;) We can perhaps provide the dimension of the stream. For instance, it could probably be estimated from Lidar data.

Marie-Amélie Boucher