Added a notebook for faster GRIB aggregations.

Anu-Ra-g commented 2 weeks ago

This is notebook is developed as a part of Google Summer of Code 2024. It describes how we can make larger aggregations for the GRIB files hosted in NODD program, in a small amount of time. The functions and operations used in this notebook, will be a part of the new version of kerchunk. This notebook is still in a draft phase, because it depends on this PR #63 to be merged first as the old pre-commit configuration is failing on the commits and some other updates as needed.

github-actions[bot] commented 2 weeks ago

👋 Thanks for opening this PR! The Cookbook will be automatically built with GitHub Actions. To see the status of your deployment, click below. 🔍 Git commit SHA: adad038a0f36dcfd3fc2f034d8d78e0c4d506e33 ✅ Deployment Preview URL: In Progress

norlandrhagen commented 2 weeks ago

Happy to merge and/or review this whenever it's ready!

Anu-Ra-g commented 2 weeks ago

@norlandrhagen The notebook is ready but it needs review. Can you please review and suggest the changes?

norlandrhagen commented 2 weeks ago

Looks great @Anu-Ra-g!

A few small comments and copy edits. Thanks for contributing. It is probably worth asking @martindurant for a review of the implementation since he is the GRIB + Kerchunk guru.

Overview:

In this tutorial we are going to demonstrate building kerchunk aggregations of NODD grib2 weather forecasts fast. -> In this tutorial we are going to demonstrate building preformance/fast/quick/WC kerchunk aggregations of NODD grib2 weather forecasts.

This workflow primarily involves xarray-datatree, pandas and grib_tree function released in kerchunkv0.2.3 for the operation. -> This workflow primarily involves xarray-datatree, pandas and the new grib_tree function released in kerchunk v0.2.3.

For this operation we will be looking at GRIB2 files generated by NOAA Global Ensemble Forecast System (GEFS), is a weather forecast model made up of 21 separate forecasts, or ensemble members. With global coverage, GEFS is produced four times a day with weather forecasts going out to 16 days, with an update frequency of 4 times a day, every 6 hours starting at midnight. -> In this notebook/demo/WC we will be looking at GRIB2 files generated by NOAA Global Ensemble Forecast System (GEFS). This dataset is a weather forecast model made up of 21 separate forecasts, or ensemble members. GEFS has global coverage and is produced four times a day with forecasts going out to 16 days. It is updated 4 times a day, every 6 hours starting at midnight.

For building the aggregation, we're going to build a hierarchical data model, to view the whole dataset ,from a set of scanned grib messages with the help of grib_tree function. This data model can be opened directly using either zarr or xarray datatree. This way of building the aggregation is very slow. Here we're going to use xarray-datatree to open and view it. -> We are using the newly implemented Kerchunk grib_tree function to build a hierarchical data model from a set of scanned grib messages.This data model can be opened directly using either zarr or xarray datatree. This way of building the aggregation is very slow. Here we're going to use xarray-datatree to open and view it:

Every NODD cloud platform stores the grib file along with its .idx(index) file, in text format. The purpose of using the idx files in the aggregation is that the k(erchunk) index data looks a lot like the idx files that are present for every grib file in NODD's GCS and AWS archive though.

This way of building of aggregation only works for a particular horizon file irrespective of the run time of the model. -> Accompanying each NODD GRIB its .idx(index) file. Kerchunk can use this as a shortcut to build references without scanning the entire GRIB message.

Note: This way of building of aggregation only works for a particular horizon file irrespective of the run time of the model.

Now we're going to need a mapping from our grib/zarr metadata(stored in the grib_tree output) to the attributes in the idx files. They are unique for each time horizon e.g. we need to build a unique mapping for the 1 hour forecast, the 2 hour forecast and so on. So in this step we're going to create a mapping for a single grib file and its corresponding idx files in order, which will be used in later steps for building the aggregation.

Before that let's see what grib data we're extracting from the datatree. The metadata that we'll be extracting will be static in nature. We're going to use a single node by accessing it.

->

Now we're going to need a mapping from our GRIB/Zarr metadata(stored in the grib_tree output) to the attributes in the .idx files. They are unique for each time horizon, so we need to build a unique mapping for the 1 hour forecast, the 2 hour forecast and so on. In this step we are going to create a mapping for a single GRIB file and its corresponding .idx files in order.

We'll start by examining the GRIB data. The metadata that we'll be extracting will be static in nature. We're going to use a single node by accessing it with datatree.

Now if we parse the runtime from the idx file , we can build a fully compatible k_index(kerchunk index) for that particular file. Before creating the index, we need to clean some of the data in the mapping and index dataframe for the some variables as they tend to contain duplicate values, as demonstrated below.

->

Now that we have pared the runtime from the .idx file, we can build a fully compatible Kerchunk index for each file. Before creating the index, we need to clean some of the data in the mapping and index dataframe for the some variables as they tend to contain duplicate values.

For the final step of the aggregation, we will create an index for each GRIB file to cover a two-month period starting from the specified date and convert it into one combined index and we can store this index for later use. We will be using the 6-hour horizon file for building the aggregation, this will be from 2017-01-01 to 2017-02-28. This is because as we already know this way of aggregation only works for a particular horizon file. With this way of building the aggregation we can index a whole of forecasts.

->

For the final step of the aggregation, we will create an index for each GRIB file to cover a two-month period and convert it into one combined index. We can store this index for later use. We will be using the 6-hour horizon file for building the aggregation, this will be from 2017-01-01 to 2017-02-28.

The difference between idx and k_index(kerchunk index) that we built in the above in the above step, is that the former indexes the grib messages and the latter indexes the variables in those messages. Now we'll need a tree model from grib_tree function to reinflate the part or the whole of the index i.e. variables in the messages as per our needs. The important point to note here is that the tree model should be made from the grib file(s) of the repository that we are indexing.

->

The difference between .idx and Kerchunk index that we built is that the former indexes the GRIB messages and the latter indexes the variables in those messages. Now we'll need a tree model fromgrib_tree function to reinflate index (variables in the messages). The important point to note here is that the tree model should be made from the GRIB files of the repository that we are indexing.

Anu-Ra-g commented 2 weeks ago

@norlandrhagen I've made the suggested changes and updated some parts of the suggestions.

norlandrhagen commented 2 weeks ago

Nice. It looks like the book build is failing with:

ImportError: cannot import name 'AggregationType' from 'kerchunk.grib2' (/home/runner/miniconda3/envs/kerchunk-cookbook/lib/python3.10/site-packages/kerchunk/grib2.py)

Maybe the kerchunk version needs to be bumped?

martindurant commented 2 weeks ago

@Anu-Ra-g , do you need a release of kerchunk? I'll merge your waiting PRs now, and we can handle any cleanup later.

Anu-Ra-g commented 2 weeks ago

@martindurant I've made the PRs #497, #498, #499 in order to support this notebook.

ProjectPythia / kerchunk-cookbook

Added a notebook for faster GRIB aggregations. #64