[!WARNING] This code is at a very early stage! It won't do anything useful for a while!
This project is inspired by kerchunk, VirtualiZarr, dynamical.org, gribberish, xarray, NODD, Camus Energy's work with GRIB files, and many other great projects!
There are tens of petabytes of GRIB datasets in public cloud object stores. Wouldn't it be nice to be able to lazily open these datasets as easily as possible?!
For example, the NOAA Open Data Dissemination (NODD) programme has shared 59 petabytes so far (and growing rapidly), and ECMWF are also busily sharing the bulk of their forecasts on cloud object storage.
One ultimate dream is to be able to train large machine learning models directly from GRIBs on cloud object storage.
For more info on the background and motivation for hypergrib
, please see this blog post.
xr.open_dataset
.hypergrib
For the planned design, please see design.md.
hypergrib
exist?At least to start with, hypergrib
is an experiment (which stands on the shoulders of giants like gribberish
, kerchunk
, Zarr
, xarray
, VirtualiZarr
etc.). The question we're asking with this experiment is: How fast can we go if we "cheat" by building a special-purpose tool focused on reading multi-file GRIB datasets from cloud object storage. Let's throw in all the performance tricks we can think of. And let's also bake in a bunch of domain knowledge about GRIBs. We're explicitly not trying to build a general-purpose tool like the awesome kerchunk
. If hypergrib
is faster than existing approaches, then maybe ideas from hypergrib
could be merged into existing tools, and hypergrib
will remain a testing ground rather than a production tool. Or maybe hypergrib
will mature into a tool that can be used in production.
Reading directly from GRIBs will probably be sufficient for a lot of use-cases.
There are read-patterns which will never be well-served by reading from GRIBs (because of the way the data is structured on disk). For example, reading a long timeseries for a single geographical point will involve reading about one million times more data from disk than you need (assuming each 2D GRIB message is 1,000 x 1,000 pixels). So, even if you sustain 20 gigabytes per second from GRIBs in object storage, you'll only get 20 kilobytes per second of useful data! For these use-cases, the data will almost certainly have to be converted to something like Zarr. (And, hopefully, hypergrib
will help make the conversion from GRIB to Zarr as efficient as possible).
(That said, we're keen to explore ways to slice into each GRIB message... e.g. some GRIBs are compressed in JPEG2000, and JPEG2000 allows parts of the image to be decompressed. And maybe, whilst making the manifest, we could decompress each GRIB file and save the state of the decompressor every, say, 4 kB. Then, at query time, if we want a single pixel then we'd have to stream at most 4 kB of data from disk. Although that has its own issues.).
hypergrib
uses "hyper" in its mathematical sense, like hypercube (an n-dimensional cube). Oh, and it's reminiscent of a very cool record label, too :)
GDAL's CSV representation of the GRIB tables
. See the README for that directory