blaylockbk / Herbie

Download numerical weather prediction datasets (HRRR, RAP, GFS, IFS, etc.) from NOMADS, NODD partners (Amazon, Google, Microsoft), ECMWF open data, and the University of Utah Pando Archive System.
https://herbie.readthedocs.io/
MIT License
482 stars 72 forks source link

Investigate: Learn about Kerchunk and figure out if Herbie could benefit #147

Open blaylockbk opened 1 year ago

blaylockbk commented 1 year ago

I keep hearing about kerchunk. I'm certain people smarter than me are working on it and would provide efficient data access. Perhaps Herbie (and GOES-2-go) could benefit from using it.

https://github.com/fsspec/kerchunk

blaylockbk commented 1 year ago

Kerchunk has the ability to "scan" a file and determine the byte ranges for each message without downloading the file.

I was able to get the byte ranges for a GRIB2 file without downloading the file

import fsspec
from kerchunk.grib2 import scan_grib
import pandas as pd

afilter = {"typeOfLevel": "heightAboveGround", "level": [2, 10]}
so = {"anon": True}

idx = scan_grib(
    "s3://noaa-hrrr-bdp-pds/hrrr.20230630/conus/hrrr.t00z.wrfsfcf01.grib2",
    storage_options=so,
    # filter=afilter,
)

df = pd.DataFrame(
    [i["refs"]["latitude/0.0"][1:] for i in idx], columns=["startByte", "bytes"]
)
df["endByte"] = df.bytes.cumsum()
df["varName"] = [list(i["refs"].keys())[3].split("/")[0] for i in idx]
df

image

However, scanning the full file took ~4 minutes! If you use the filters you can cut this down to about 20 seconds.

As far as I can tell, it uses eccodes, so the naming convention doesn't follow wgrib2, which Herbie users are familiar with.