When a GRIB file contains multiple messages describing data variables with the same paramId and typeOfLevel, but differing other attributes, these variables will be combined into one DataArray by the cfgrib.open_datasets() function if their data also matches.
An example of this is NCEP GFS Forecasts (downloadable here), which provide variables such as cprat and cfrzr both with stepType = instant and stepType = avg. In the vast majority of such cases, thanks to the join="exact" parameter passed to xarray's merge function, this results in an output of two datasets, each with the same dimensions, and each with a variable of that name, but with different GRIB_stepType attributes and different data on each. This is the desired result.
But if the data (usually a numoy ndarray of floats) is equal, these two variables get merged into one, with the last one being read updating the DataArray's attrs. This means one less variable and potentially even one less dataset in the output than there should be. At first glance, this might indeed seem like a very rare occurrence, but sadly it happens quite frequently. To stick with the NCEP example, variables such as crain, csnow, cicep and cfrzr, which all denote Categorical Precipitation types, are boolean in nature, and thus have either 0 or 1 as value. Depending on the region and time of year, both the average and instant values for Categorical Freezing Rain will quite often be all 0's, causing the disappearance of an expected variable in the output.
I believe the cause to be this:
open_datasets calls the function merge_datasetsin line 115 as such:
for ds in merge_datasets(type_of_level_datasets[type_of_level], join="exact"):
This function (link) only checks if the attributes of the dataset match before attempting to merge them, and uses join="exact" to prevent merges such as the one described above. However, this parameter only concerns the data values, and does not compare attributes on the DataArrays that make up the Dataset.
Since version 0.17.0 (released Feb 2021), xarray's merge function now also supports a combine_attrs parameter, which performs similarly to the join parameter, but for attributes. Adding this parameter to the kwargs passed to merge_datasets, set to combine_attrs="identical" (or combine_attrs="no_conflicts" for slightly more lenient matching), resolves this issue and correctly returns two separate variables in two separate datasets.
I would like to note that this issue has some similarities to #268 , and that it is indeed technically possible to still retrieve the data from the GRIB files using filter_by_keys for different stepType's. However, the current behaviour of open_datasets still appears to be a bug, and as such I believe it worth reporting
When a GRIB file contains multiple messages describing data variables with the same
paramId
andtypeOfLevel
, but differing other attributes, these variables will be combined into one DataArray by thecfgrib.open_datasets()
function if their data also matches.An example of this is NCEP GFS Forecasts (downloadable here), which provide variables such as
cprat
andcfrzr
both withstepType = instant
andstepType = avg
. In the vast majority of such cases, thanks to thejoin="exact"
parameter passed to xarray's merge function, this results in an output of two datasets, each with the same dimensions, and each with a variable of that name, but with differentGRIB_stepType
attributes and different data on each. This is the desired result.But if the data (usually a numoy ndarray of floats) is equal, these two variables get merged into one, with the last one being read updating the DataArray's
attrs
. This means one less variable and potentially even one less dataset in the output than there should be. At first glance, this might indeed seem like a very rare occurrence, but sadly it happens quite frequently. To stick with the NCEP example, variables such ascrain
,csnow
,cicep
andcfrzr
, which all denote Categorical Precipitation types, are boolean in nature, and thus have either 0 or 1 as value. Depending on the region and time of year, both the average and instant values for Categorical Freezing Rain will quite often be all 0's, causing the disappearance of an expected variable in the output.I believe the cause to be this:
open_datasets
calls the functionmerge_datasets
in line 115 as such:for ds in merge_datasets(type_of_level_datasets[type_of_level], join="exact"):
This function (link) only checks if the attributes of the dataset match before attempting to merge them, and usesjoin="exact"
to prevent merges such as the one described above. However, this parameter only concerns the data values, and does not compare attributes on the DataArrays that make up the Dataset.Since version 0.17.0 (released Feb 2021), xarray's merge function now also supports a
combine_attrs
parameter, which performs similarly to thejoin
parameter, but for attributes. Adding this parameter to the kwargs passed tomerge_datasets
, set tocombine_attrs="identical"
(orcombine_attrs="no_conflicts"
for slightly more lenient matching), resolves this issue and correctly returns two separate variables in two separate datasets.I would like to note that this issue has some similarities to #268 , and that it is indeed technically possible to still retrieve the data from the GRIB files using
filter_by_keys
for different stepType's. However, the current behaviour ofopen_datasets
still appears to be a bug, and as such I believe it worth reporting