Adding DKRZ drs for `native6` to enable usage of local vast collection of ERA5 data (hourly, grb)

remi-kazeroni commented 1 year ago

From the reviewers of our IS-ENES3 deliverable D9.5:

Angelika has just (almost) completed to retrieve the 1940–2022 hourly time series from ECMWF Mars Tape Archive to Levante /pool/data/ERA5. This comprises • surface level analysis (49 parameters) • surface level forecasts (55 parameters) • model level analysis (16 parameters) (retrieval 1940–1958 ongoing) • pressure level analysis (16 parameters) (*retrieval 1940–1950 ongoing). The data are not the 0.25° regridded ERA5 data that users can download from Copernicus CDS, but the native resolution ERA5 data (T639/N320) that can be retrieved from Mars, only. The now around 1550 Tb of data are stored directly on Levante’s disk storage, globally accessible via /pool/data/ERA5/E5.

We could think of adding a drs in our config file to enable ESMValTool to access this vast collection of ERA5 data. This would be interesting if DKRZ users would like to benefit from having access to that data pool. It could be an interesting test to check how well grib files are handled by ESMValTool.

I'm not sure this could completely replace our own local collection of RAWOBS ERA5 data (downloaded from CDS with era5cli). The reason is such retrieval of huge amount of data cannot be done by our users on their own. These would need to continue relying on tools like era5cli or cdsapi to create their own local pools of ERA5 data on their own machines/clusters. Thus, it might be good to continue testing ESMValTool as done now with ERA5 in our own RAWOBS to better reproduce what's done by a majority of our users.

This idea is similar to that of #1246 for Jasmin. See also DKRZ docs on ECMWF reanalysis products available locally.

schlunma commented 1 year ago

I started looking into this some days ago. Here are some insights:

Reading the data

Reading the data with the cfgrib engine of xarray works out of the box, and conversion to iris cubes using DataArray.to_iris() also seems to work fine after some minor preprocessing. From the code and some tests it looks like this does not realize the data nor save the data to disk in any way (which is good!). However, to implement this into ESMValCore, we need to expand fix_file (https://github.com/ESMValGroup/ESMValCore/issues/2129).

Grid

The raw data is stored on a reduced Gaussian grid (N320), which uses a different number of longitudes for the different latitudes. Thus, the data is not stored like a regular grid (time, latitude, longitude), but rather like an unstructured grid (time, spatial_dimension). For example, after converting to netcdf, the files look like this:

netcdf tas {                                                                                                                                                                                                                                                                  
dimensions:                                                                                                                                                                                                                                                                   
        time = 24 ;                                                                                                                                                                                                                                                           
        values = 542080 ;                                                                                                                                                                                                                                                     
variables:                                                                                                                                                                                                                                                                    
        int64 time(time) ;
                time:long_name = "initial time of forecast" ;
                time:standard_name = "forecast_reference_time" ;
                time:units = "seconds since 1970-01-01" ;
                time:calendar = "proleptic_gregorian" ;
        double latitude(values) ;
                latitude:_FillValue = NaN ;
                latitude:units = "degrees_north" ;
                latitude:standard_name = "latitude" ;
                latitude:long_name = "latitude" ;
        double longitude(values) ;
                longitude:_FillValue = NaN ;
                longitude:units = "degrees_east" ;
                longitude:standard_name = "longitude" ;
                longitude:long_name = "longitude" ;
        float t2m(time, values) ;
                t2m:_FillValue = NaNf ;
                t2m:GRIB_paramId = 167LL ;
                ...
        ...
}

You can see that the actual variable (t2m) just depends on two dimensions time and values, where values encodes the spatial grid. The question is now, how do we deal with this? I can think of the following options:

We can pass the data as is to the preprocessing chain and let the user deal with regridding (however, in this case only the unstructured_nearest scheme can be used, which works fine [I tested it] but might be inaccurate).
We can convert the data to UGRID and let the user deal with regridding (in this case, the iris-esmf-regrid library can be used for regridding). However, this might be tricky since no bounds for the grid cells (= nodes in UGRID jargon) are given, which are absolutely necessary for the UGRID conversion.
We perform the regridding (see options 1. or 2.) in the fix and pass the data on a regular grid to the preprocessing chain.

@ESMValGroup/esmvaltool-coreteam does anyone have experience with regridding data on a reduced Gaussian grid with Python? Any insights/help is much appreciated. Thank you!

schlunma commented 1 year ago

One point I forgot: contrary to the ERA5 documentation, as far as I can tell all 3D variables on pressure levels are also saved on the reduced Gaussian grids (N320), not as T639 spherical harmonics. For example, temperature is listed to be on the T639 native grid, but the data on Levante is on the N320 grid. No idea if this is a service of DKRZ or an error in the ERA5 documentation.

In contrast, some variables on model levels are in fact reported on the T639 grid.

I think we are mainly interested in data on pressure levels, so we do not have to deal with spherical harmonics (for now).

larsbuntemeyer commented 1 year ago

Just stumbled across this by coincident! I work a lot with the DKRZ ERA5 data pool and i follow the ECMWF recommendations, .e.g., for

model levels on T639 (spectral): cdo sp2gpl
model levels on N320 (gaussian reduced): cdo setgridtype,regular

does anyone have experience with regridding data on a reduced Gaussian grid with Python?

I would also be interested in that, e.g., have a kind of lazy method to do it. Probably it's worth mentioning ERA5 on google cloud. I got valuable insights from their walkthrough (includes regridding with scipy)...

larsbuntemeyer commented 1 year ago

Probably also worth mentioning:

schlunma commented 1 year ago

Thanks @larsbuntemeyer for these links, they look super interesting! I will look into that!

bouweandela commented 1 year ago

We can pass the data as is to the preprocessing chain and let the user deal with regridding

This would be my recommendation: some users may not want automatic regridding.

ESMValGroup / ESMValCore

Adding DKRZ drs for `native6` to enable usage of local vast collection of ERA5 data (hourly, grb) #1991

Reading the data

Grid