NCAR / pynio

PyNIO is a multi-format data I/O package with a NetCDF-style interface
http://www.pyngl.ucar.edu/Nio.shtml
Apache License 2.0
112 stars 37 forks source link

Variable names inferred from grib2 files through Nio.open_file - or xr.open_dataset with engine pynio #19

Open chiaral opened 6 years ago

chiaral commented 6 years ago

I posted this on SO, but I think it belongs here as well.

Below an edited version:

I am using the data found here (Note, these are rotating files of forecast data, so the actual date will change as time goes by, you might need to update the date in my example in a few days)

!wget ftp://ftpprd.ncep.noaa.gov/pub/data/nccf/com/cfs/prod/cfs/cfs.20180524/00/6hrly_grib_01/pgbf2018052400.01.2018052400.grb2

tmp2 = xr.open_dataset('pgbf2018052400.01.2018052400.grb2', 
                   engine='pynio')

which is read as a dataset with some coordinates and variables

<xarray.Dataset>
Dimensions:                   (lat_0: 181, lon_0: 360, lv_AMSL1: 4, lv_HTGL6: 2, lv_ISBL0: 37, lv_ISBL4: 32, lv_ISBL5: 2, lv_PVL2: 2, lv_SIGL3: 4)
Coordinates:
  * lon_0                     (lon_0) float32 0.0 1.0 2.0 3.0 4.0 5.0 6.0 ...
  * lv_ISBL0                  (lv_ISBL0) float32 100.0 200.0 300.0 500.0 ...
  * lv_ISBL5                  (lv_ISBL5) float32 50000.0 100000.0
  * lv_ISBL4                  (lv_ISBL4) float32 1000.0 2000.0 3000.0 5000.0 ...
  * lat_0                     (lat_0) float32 90.0 89.0 88.0 87.0 86.0 85.0 ...
  * lv_PVL2                   (lv_PVL2) float32 -2e-06 2e-06
  * lv_AMSL1                  (lv_AMSL1) float32 1829.0 2743.0 3658.0 4572.0
Dimensions without coordinates: lv_HTGL6, lv_SIGL3
Data variables:
    TOZNE_P0_L200_GLL0        (lat_0, lon_0) float32 ...
    HGT_P0_L100_GLL0          (lv_ISBL0, lat_0, lon_0) float32 ...
    VVEL_P0_L100_GLL0         (lv_ISBL0, lat_0, lon_0) float32 ...
    UGRD_P0_L6_GLL0           (lat_0, lon_0) float32 ...
    HGT_P0_L7_GLL0            (lat_0, lon_0) float32 ...
    CIN_P0_2L108_GLL0         (lat_0, lon_0) float32 ...
    PRES_P0_L109_GLL0         (lv_PVL2, lat_0, lon_0) float32 ...
    UGRD_P0_L7_GLL0           (lat_0, lon_0) float32 ...
    UGRD_P0_L109_GLL0         (lv_PVL2, lat_0, lon_0) float32 ...
    PRES_P0_L7_GLL0           (lat_0, lon_0) float32 ...
    PLI_P0_2L108_GLL0         (lat_0, lon_0) float32 ...

and many other variables... i am not copying here...

my struggle right now is to understand how these variable names (i.e. PLI_P0_2L108_GLL0) are generated.

If I run directly Nio:

f = Nio.open_file('pgbf2018052400.01.2018052400.grb2')
print(f)

I get:

Nio file:   pgbf2018052400.01.2018052400.grb2
   global attributes:
   dimensions:
      lat_0 = 181
      lon_0 = 360
      lv_ISBL0 = 37
      lv_AMSL1 = 4
      lv_PVL2 = 2
      lv_SIGL3 = 4
      lv_ISBL4 = 32
      lv_ISBL5 = 2
      lv_HTGL6 = 2
   variables:
      float TMP_P0_L6_GLL0 [ lat_0, lon_0 ]
         center :   US National Weather Service - NCEP (WMC)
         production_status :    Operational products
         long_name :    Temperature
         units :    K
         _FillValue :   1e+20
         grid_type :    Latitude/longitude
         parameter_discipline_and_category :    Meteorological products, Temperature
         parameter_template_discipline_category_number :    [0, 0, 0, 0]
         level_type :   Maximum wind level
         level :    0
         forecast_time :    0
         forecast_time_units :  hours
         initial_time : 05/24/2018 (00:00)
      float TMP_P0_L7_GLL0 [ lat_0, lon_0 ]

etc...

but the names of the variables are already there. So it seems like these names are defined at Nio level (I think).

However, when I inspect the data with wgrib2 I have the following:

wgrib2 pgbf2018052400.01.2018052400.grb2

1:0:d=2018052400:PRES:mean sea level:anl:
2:66570:d=2018052400:HGT:1 mb:anl:
3:102113:d=2018052400:TMP:1 mb:anl:
4:118939:d=2018052400:RH:1 mb:anl:
5:122609:d=2018052400:SPFH:1 mb:anl:
6:141106:d=2018052400:VVEL:1 mb:anl:
7.1:219137:d=2018052400:UGRD:1 mb:anl:
7.2:219137:d=2018052400:VGRD:1 mb:anl:
8:283014:d=2018052400:ABSV:1 mb:anl:
9:319471:d=2018052400:O3MR:1 mb:anl:
10:372630:d=2018052400:HGT:2 mb:anl:
11:408159:d=2018052400:TMP:2 mb:anl:
12:425397:d=2018052400:RH:2 mb:anl:
13:427490:d=2018052400:SPFH:2 mb:anl:
14:459259:d=2018052400:VVEL:2 mb:anl:
15.1:517010:d=2018052400:UGRD:2 mb:anl:
15.2:517010:d=2018052400:VGRD:2 mb:anl:
16:580043:d=2018052400:ABSV:2 mb:anl:
17:617506:d=2018052400:O3MR:2 mb:anl:
18:672805:d=2018052400:HGT:3 mb:anl:
19:707799:d=2018052400:TMP:3 mb:anl:
20:725593:d=2018052400:RH:3 mb:anl:
21:729682:d=2018052400:SPFH:3 mb:anl:
22:765030:d=2018052400:VVEL:3 mb:anl:
23.1:828383:d=2018052400:UGRD:3 mb:anl:
23.2:828383:d=2018052400:VGRD:3 mb:anl:
24:892759:d=2018052400:ABSV:3 mb:anl:
25:931422:d=2018052400:O3MR:3 mb:anl:
26:979930:d=2018052400:HGT:5 mb:anl:

The variable names are PRES, HGT, ABSV, VGRD, ACPCP, and so on.

The problem I have is that I am trying to run an open_mfdataset in xarray to concatenate along multiple dimensions (in my case forecast start time and forecast lead time) thousands of these little files, unfortunately the variable names (of what should be the same quantity, i.e. convective precipitation, which in the grib file is simply named ACPC) generated by xarray/PyNIO change throughout the dataset going from ACPCP_P8_L1_GLL0_acc, to ACPCP_P8_L1_GLL0_acc6h, to ACPCP_P0_L1_GLL0, making things complicated.

I tried to overwrite the variable names with some preprocessing function, but I wanted to understand the rationale behind those names, to be sure that I am doing it correctly.

So, where does the part of the variable name, like "_P0_L6_GLL0", attached to the grib variable name come from? is there a way to control that?

chiaral commented 6 years ago

After digging in the PyNIO page I found the GRIB2 options, and a description of how the names are created. Apologies for not having gone through it all before.

GRIB2 data variable name encoding (Note: examples show intermediate steps in the formation of the name)

if production status is TIGGE test or operational and matches entry in TIGGE table:
   <parameter_short_name> (ex: t) 
else if entry matching product discipline, parameter category, and parameter number is found:
   <parameter_short_name> (ex: TMP) 
else:
   VAR_<product_discipline_number>_<parameter_category_number>_<parameter_number> (ex: VAR_3_0_9)

_P<product_definition_template_number> (ex: TMP_P0)

if single level type:
   _L<level_type_number> (ex: TMP_P0_L103)
else if two levels of the same type: 
   _2L<level_type_number> (ex: TMP_P0_2L106)
else if two levels of different types: 
   _2L<_first_level_type_number>_<second_level_type_number> (ex: LCLD_P0_2L212_213)

if grid type is supported (fully or partially):
   _G<grid_abbreviation><grid_number> (ex: UGRD_P0_L108_GLC0)
else:
   _G<grid_number> (ex: UGRD_P0_2L104_G0)

if not statistically processed variable and not duplicate name the name is complete at this point.

if statistically-processed variable and constant statistical processing duration:
   if statistical processing type is defined:
      _<statistical_processing_type_abbreviation><statistical_processing_duration><duration_units> (ex: APCP_P8_L1_GLL0_acc3h) 
   else
      _<statistical_processing_duration><duration_units> (ex: TMAX_P8_L103_GCA0_6h)
else if statistically-processed variable and variable-duration processing always begins at initial time:
   _<statistical_processing_type_abbreviation> (ex: ssr_P11_GCA0_acc)

if variable name is duplicate of existing variable name (this should not normally occur):
   _n (where n begins with 1 for first duplicate) (ex: TMAX_P8_L103_GCA0_6h_1)

You also note on the same page linked above "Generally speaking all of the applicable characteristics must be the same for two records to be considered part of the same variable."

Is there a way/option to limit it to, for example, TMAX?

I am very new to GRIB2, and I usually extracted data using WGRIB2, or pygrib, using shortname to access the variable i am interested in. I am just curious to know why the choice to add these attributes (, for example) to the definition of the name? I know that GRIB2 files format is a bit of a wild west, and that this might make sense for the majority of the times, but some time we might not want that behavior.

Do you think this is a feature you would be interested in developing (meaning, limit the name to the variable name to just, i.e., parameter_short_name)?

That would make the usage of PyNIO within open_mfdataset() flawless, whereas now I need to write ad-hoc preprocessing function to change the variable name constructed by PyNIO.

thanks!

david-ian-brown commented 6 years ago

There is an Nio option that you can set called TimePeriodSuffix. It does not seem to be mentioned in the PyNIO documentation, but here is what the NCL documentation says. I don't think there is any reason it would not be available for PyNIO as well. It may not do everything you want, but it may help.

*TimePeriodSuffix*Default value: True

This option applies to GRIB1- and GRIB2-formatted files. A value of True indicates that statistically-processed variables such as averages and accumulations have a time period and time unit added after the suffix indicating the statistical variable type. For example, the suffix "_avg3h" represents a 3 hour average. These suffixes are required to uniquely characterize otherwise identically-named variables that have different periods and/or units within the same file. However, when concatenating variables from different files using the addfiles http://www.ncl.ucar.edu/Document/Functions/Built-in/addfiles.shtml function differences in these suffixes can prevent individual variables from being concatenated into a single composite variable when it is actually desireable. Setting this option to False removes the time period and units from the variable name leaving only the statistical processing type (e.g. "_avg" for an average or "_acc" for an accumulation).

On Fri, May 25, 2018 at 8:32 AM, chiaral notifications@github.com wrote:

After digging in the PyNIO page I found the GRIB2 options https://www.pyngl.ucar.edu/NioFormats.shtml#GRIB2-support-details, and a description of how the names are created. Apologies for not having gone through it all before.

GRIB2 data variable name encoding (Note: examples show intermediate steps in the formation of the name)

if production status is TIGGE test or operational and matches entry in TIGGE table:

(ex: t) else if entry matching product discipline, parameter category, and parameter number is found: (ex: TMP) else: VAR___ (ex: VAR_3_0_9) _P (ex: TMP_P0) if single level type: _L (ex: TMP_P0_L103) else if two levels of the same type: _2L (ex: TMP_P0_2L106) else if two levels of different types: _2L<_first_level_type_number>_ (ex: LCLD_P0_2L212_213) if grid type is supported (fully or partially): _G (ex: UGRD_P0_L108_GLC0) else: _G (ex: UGRD_P0_2L104_G0) if not statistically processed variable and not duplicate name the name is complete at this point. if statistically-processed variable and constant statistical processing duration: if statistical processing type is defined: _ (ex: APCP_P8_L1_GLL0_acc3h) else _ (ex: TMAX_P8_L103_GCA0_6h) else if statistically-processed variable and variable-duration processing always begins at initial time: _ (ex: ssr_P11_GCA0_acc) if variable name is duplicate of existing variable name (this should not normally occur): _n (where n begins with 1 for first duplicate) (ex: TMAX_P8_L103_GCA0_6h_1) You also note on the same page linked above "Generally speaking all of the applicable characteristics must be the same for two records to be considered part of the same variable." Is there a way/option to limit it to, for example, TMAX? I am very new to GRIB2, and I usually extracted data using WGRIB2, or pygrib, using shortname to access the variable i am interested in. I am just curious to know why the choice to add these attributes (, for example) to the definition of the name? I know that GRIB2 files format is a bit of a wild west, and that this might make sense for the majority of the times, but some time we might not want that behavior. Do you think this is a feature you would be interested in developing (meaning, limit the name to the variable name to just, i.e., parameter_short_name)? That would make the usage of PyNIO within open_mfdataset() flawless, whereas now I need to write ad-hoc preprocessing function to change the variable name constructed by PyNIO. thanks! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread .
chiaral commented 6 years ago

Thanks! indeed this helped a little:

using the one example on the webpage and scrolling down to Nio Usage:

import Nio
Nio.option_defaults

{'CompressionLevel': -1,
 'DefaultNCEPPtable': 'operational',
 'ExplicitFillValues': None,
 'Format': 'classic',
 'HeaderReserveSpace': 0,
 'InitialTimeCoordinateType': 'numeric',
 'MaskAboveValue': None,
 'MaskBelowValue': None,
 'MaskedArrayMode': 'MaskedIfFillAtt',
 'MissingToFillValue': True,
 'PreFill': True,
 'SafeMode': False,
 'SingleElementDimensions': 'none',
 'ThinnedGridInterpolation': 'cubic',
 'TimePeriodSuffix': True,
 'UseAxisAttribute': False}

opt = Nio.options()
opt.TimePeriodSuffix = False

Then running f = Nio.open_file('gribfile.grb2', options=opt) gets rid of the hour. I still would like to get rid of the whole P0..... but at least it helps reducing some differences.

chiaral commented 6 years ago

For future references, in order to modify any of Nio's options when we use it as an engine in xr.open_mfdataset(), one needs to modify the default values:

Nio.option_defaults['TimePeriodSuffix']=True

NicWayand commented 5 years ago

Hi @chiaral, have you gotten clean short variable names working with the new pynio backend_kwargs? I still get TypeError: __init__() got an unexpected keyword argument 'TimePeriodSuffix' errors. It doesn't seem to like any of my Nio backend keywords. I also can't get Nio.option_defaults['TimePeriodSuffix']=True to effect my xr.open_dataset(...). Although using Nio.open_file(..) with option, does drop the suffix as expect.