ecmwf / cfgrib

A Python interface to map GRIB files to the NetCDF Common Data Model following the CF Convention using ecCodes
Apache License 2.0
409 stars 77 forks source link

open_datasets() merges variables together if data is equal #336

Closed Metamess closed 1 year ago

Metamess commented 1 year ago

When a GRIB file contains multiple messages describing data variables with the same paramId and typeOfLevel, but differing other attributes, these variables will be combined into one DataArray by the cfgrib.open_datasets() function if their data also matches.

An example of this is NCEP GFS Forecasts (downloadable here), which provide variables such as cprat and cfrzr both with stepType = instant and stepType = avg. In the vast majority of such cases, thanks to the join="exact" parameter passed to xarray's merge function, this results in an output of two datasets, each with the same dimensions, and each with a variable of that name, but with different GRIB_stepType attributes and different data on each. This is the desired result.

But if the data (usually a numoy ndarray of floats) is equal, these two variables get merged into one, with the last one being read updating the DataArray's attrs. This means one less variable and potentially even one less dataset in the output than there should be. At first glance, this might indeed seem like a very rare occurrence, but sadly it happens quite frequently. To stick with the NCEP example, variables such as crain, csnow, cicep and cfrzr, which all denote Categorical Precipitation types, are boolean in nature, and thus have either 0 or 1 as value. Depending on the region and time of year, both the average and instant values for Categorical Freezing Rain will quite often be all 0's, causing the disappearance of an expected variable in the output.

I believe the cause to be this: open_datasets calls the function merge_datasets in line 115 as such: for ds in merge_datasets(type_of_level_datasets[type_of_level], join="exact"): This function (link) only checks if the attributes of the dataset match before attempting to merge them, and uses join="exact" to prevent merges such as the one described above. However, this parameter only concerns the data values, and does not compare attributes on the DataArrays that make up the Dataset.

Since version 0.17.0 (released Feb 2021), xarray's merge function now also supports a combine_attrs parameter, which performs similarly to the join parameter, but for attributes. Adding this parameter to the kwargs passed to merge_datasets, set to combine_attrs="identical" (or combine_attrs="no_conflicts" for slightly more lenient matching), resolves this issue and correctly returns two separate variables in two separate datasets.

I would like to note that this issue has some similarities to #268 , and that it is indeed technically possible to still retrieve the data from the GRIB files using filter_by_keys for different stepType's. However, the current behaviour of open_datasets still appears to be a bug, and as such I believe it worth reporting

iainrussell commented 1 year ago

Hi @Metamess, thank you for this comprehensive description - I think we'll shortly have your fix merged.