Introducing a new input dataset

matteodefelice commented 5 years ago

I would like to use atlite with the EFAS data recently released on the Copernicus Data Store. Do you have a tutorial or some docs on how to "add" another input dataset?

euronion commented 5 years ago

No. As far as I know there is no tutorial for setting up new data sources at the moment.

The ERA-5 data is also downloaded from the CDS, see datasets/era5.py. You should be able to use it as a starting point. I expect the structure for the request and postprocessing to be similar.

Important side note: @coroa Has been working on a new data backend. This will also affect the way the data importers work. You might want to check with him first to see what will change in the upcoming version.

coroa commented 5 years ago

Hi Matteo,

I did not miss the email you sent on Monday. I'm glad you're serious about getting things done.

Don't worry about the refactoring work happening in branch v0.2. I tried to push a bit further on Monday, but I am unsure on when this is going to land and any work happening now can easily be ported so let's stick with what there is now.

Since EFAS is on the CDS just like ERA-5, I think the easiest would be to copy era5.py in datasets to efas.py and add it to the imports in datasets/init.py, which is why I just now pushed a skeleton of a new dataset for EFAS to a separate branch on github.

https://github.com/FRESNA/atlite/tree/efas

But now to the trickier general question you are only asking implicitly:

How do we generate inflow time-series for powerplants based on the EFAS dataset?

Atlite implements currently functions runoff and hydro in convert.py for aggregating surface runoff data for regions (runoff) or for powerplants by using the topology of shapes defined by the hydrobasins dataset (hydro).

EFAS as far as I understand from its documentation does not provide a corrected surface runoff we might want to aggregate. Instead it already includes all the basin drainage, river drainage and river routing internally and provides the resulting river discharge per (5km)^2 grid cell.

So, technically, everything you need to do to use this data is to locate the powerplants of interest on the EFAS grid and then return the time-series for this grid cell. I am unsure of what the benefit of using atlite is in this case, as the whole process boils down to some

ds['river_discharge'].sel(x=, y=, method='nearest')

call for each powerplant, or a joint sel_points call.

The only tricky bit I can see is alluded to in the documentation, albeit incomplete:

In EFAS, river discharge is calculated in each hydrological model grid box. However, not all grid boxes correspond to a river segments, and because of the relatively coarse resolution of the EFAS gridded domain, the modelled river network in EFAS might be associated with slightly different latitude-longitude than in the real world. It is hence essential that when extracting river discharge data, the user also checks the upstream area of the grid box of interest. Accordingly, EFAS gridded upstream area information is provided as (TBA)

The way to do this is explained in the how to read EFAS data pages.

Unfortunately the linked EFAS data pages are still a stub, but it might make sense to enter into discussion with them and include this correction methodology into atlite to enable a call to

cutout.hydro(powerplants)

on a cutout providing river_discharge to return the correct per powerplant inflow.

(Edit: Damn, replies by email don't support Markdown apparantly)

coroa commented 5 years ago

I put a bit more thought into locating the correct grid cells. Let me illustrate the problem and what I assume the solution is (and I think it would be great to integrate that into atlite).

Consider the following plot of the river Saale in Thuringia and its basin, which flows into the Elbe (not shown): GWK56_meanfriv Red lines are the river itself (solid is the river network of Saale up to level 3, dashed its tributaries on level 4, where levels are taken from the German Gewässerkennzahl system). In Orange, I added the delineations of the catchment basins as provided by the Hydrobasins dataset (in level 7, i think, but would have to check). In the background I show the mean river flow from a dataset produced by the HD-model maintained by the Helmholtz Zentrum Geesthacht, which is similar to LISFLOOD, albeit simpler. The colored dots show GRDC stations for which measured daily river flow data can be requested from the BfG (a German institution).

The problem is now that for example for the GRDC station marked in orange the closest river flow grid cell holds the river flow through the main river stem coming from the South, while the station only sits at a tiny tributary entering from the right. So as a correction, in this case one would have to take the river flow immediately to the East.

From the EFAS quote above and a discussion I had with HZG earlier, I gather that the preferred solution to this conundrum is to publish the catchment basin area for each grid cell alongside the grid.

To locate the best grid cell for a particular latitude, longitude position, you then compute the total area of the upstream catchment basins (based on hydrobasins for example) and find the grid cell close to your location that best represents the upstream area.

matteodefelice commented 5 years ago

To locate the best grid cell for a particular latitude, longitude position, you then compute the total area of the upstream catchment basins (based on hydrobasins for example) and find the grid cell close to your location that best represents the upstream area.

I am not sure to have understood correctly (I am not an expert for hydrological data). What does it mean for a grid cell to represent the upstream area?

coroa commented 5 years ago

Neither am I, but what I mean is:

Look through all neighbouring grid cells (probably only direct neighbours) and use the one where the hydrobasins estimated upstream area is similar to the one provided by the EFAS grid.

coroa commented 5 years ago

Estimation of upstream area using hydrobasins would work something like:

gpd.sjoin(powerplants, hydrobasins[['UP_AREA', 'geometry']], how='left', op='within').rename(columns={'UP_AREA': 'estim_area'})

A test with the reported upstream areas for the German stations in the GRDC stations catalogue finds a relatively good agreement for rivers with high flow volumes:

import os
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

HYDRO_BASINS_DATA_PATH = os.path.join(
    'hybas_eu_lev01-12_v1c',
    'hybas_eu_lev12_v1c.shp'
)

GRDC_PATH = 'grdc_stations.xlsx'

hydrobasins = gpd.read_file(HYDRO_BASINS_DATA_PATH)

grdc = pd.read_excel(GRDC_PATH, na_values=[-999, 'n.a.'])
grdc_de = grdc.loc[grdc.country == 'DE']

grdc_de = gpd.GeoDataFrame(grdc_de, geometry=grdc_de[['long', 'lat']].apply(Point, axis=1), crs=dict(init='epsg:4326'))
grdc_de =  gpd.sjoin(grdc_de, hydrobasins[['UP_AREA', 'geometry']], how='left', op='within').rename(columns={'UP_AREA': 'estim_area'})

pd.DataFrame(grdc_de).plot.scatter(x='area', y='estim_area', s=grdc_de.r_volume_yr)
plt.xscale('log')
plt.yscale('log')

grdc_estim_upstream_area

coroa commented 5 years ago

Anyway, since the link to upArea.nc the EFAS available data page is still dead, it makes sense to reach out to them, anyway. Would you mind doing that, Matteo?

matteodefelice commented 5 years ago

Sure! I will let you know once I get a reply.

coroa commented 5 years ago

It seems, your email was received, since the wiki page has just now been updated with a note, that the upstream area is provided as an extra variable in the netcdf files:

In [5]: ds.upArea                                                                                                      
Out[5]: 
<xarray.DataArray 'upArea' (y: 950, x: 1000)>
[950000 values with dtype=float32]
Coordinates:
  * y           (y) float64 5.498e+06 5.492e+06 ... 7.575e+05 7.525e+05
  * x           (x) float64 2.502e+06 2.508e+06 ... 7.492e+06 7.498e+06
    time        datetime64[ns] ...
    step        timedelta64[ns] ...
    surface     int64 ...
    latitude    (y, x) float32 ...
    longitude   (y, x) float32 ...
    valid_time  datetime64[ns] ...
Attributes:
    grid_mapping:    lambert_azimuthal_equal_area
    long_name:       upstream_area
    standard_name:   upstream
    units:           M2
    esri_pe_string:  PROJCS["ETRS_1989_LAEA",GEOGCS["GCS_ETRS_1989",DATUM["D_...
    cell_methods:    time: mean

PyPSA / atlite

Introducing a new input dataset #18