Add support to open multiple GRIB files as a single Stream / Dataset

ecmwf / cfgrib

A Python interface to map GRIB files to the NetCDF Common Data Model following the CF Convention using ecCodes

Apache License 2.0

400 stars 77 forks source link

Add support to open multiple GRIB files as a single Stream / Dataset #15

Open alexamici opened 6 years ago

alexamici commented 6 years ago

At low level we use an explicit file path and file offset in several places.

Note that xr.open_mfdataset handles opening and merging of multiple files without any additional support from the low-level driver so this feature is low priority.

aolt commented 6 years ago

I was thinking to use cfgrib to convert a lot of grib files into a big xarray and save it all to zarr. I would really benefit of having this feature, because it will save me from the intermediate converting grib files into netcdf to be later processed by xarray. Any info on when approximately this will be available?

alexamici commented 6 years ago

@aolt we intend to prepare a Pull Request to add GRIB support via cfgrib to xarry. If and when this is accepted you will be able to use the xarray.open_mfdataset API directly.

I have no ETA yet, but becoming a first class driver in xarray is one of the main targets of the project.

alexamici commented 5 years ago

A cfgrib backend has just been included in xarray:

https://github.com/pydata/xarray/pull/2476

With the upcoming v0.11 you will be able to:

>>> ds = xr.open_mfdataset(['file1.grib', 'file2.grib'], engine='cfgrib', concat_dim='step')

aolt commented 5 years ago

Great! It works fine with small files, but I get "Memory Error" on many big files. Is it possible to make it working the same way NetCDF backend works with "lazy" read?

>>> xr.__version__
'0.11.0'

pip list |grep cfgrib
cfgrib           0.9.3.1   

python -m cfgrib selfcheck
Found: ecCodes v2.6.0.
Your system is ready.

python -V
Python 3.7.0

alexamici commented 5 years ago

@aolt the theory was that everything was lazy already... but in practice I noticed yesterday a really dumb bug that was loading the whole dataset into memory unconditionally at open 🤦‍♂️

The bug is fixed in version 0.9.4, please upgrade and try again.

I'm currently running a mean on 320Gb of GRIB files on 10 dask.distributed nodes, so I'm confident it's working now :)

alexamici commented 5 years ago

Even if there is some merit in opening several GRIB files as a single cfgrib.Dataset I'm changing this to wontfix as xarray.open_mfdataset is what almost everybody really wants.

ShaneMill1 commented 4 years ago

Hello, I have a quick question regarding this topic. I notice that cfgrib has the following ability:

cfgrib also provides a function that automate the selection of appropriate filter_by_keys and returns a list of all valid xarray.Dataset's in the GRIB file using the cfgrib.open_datasets().

I wanted to ask if this is only for a single grib file, or if it is possible to supply a path that will create the datasets similar to xarray.open_mfdataset(). I'm looking for the ability to automate the selection of filter_by_keys with the ability to open multiple files to create the datasets.

Thanks!

EDIT: it appears that cfgrib.open_datasets() only handles one grib file at a time. however, if you have a directory of grib files, you can "cat" these files into one grib file and then read with cfgrib.open_datasets()