Open blaylockbk opened 1 year ago
Kerchunk has the ability to "scan" a file and determine the byte ranges for each message without downloading the file.
I was able to get the byte ranges for a GRIB2 file without downloading the file
import fsspec
from kerchunk.grib2 import scan_grib
import pandas as pd
afilter = {"typeOfLevel": "heightAboveGround", "level": [2, 10]}
so = {"anon": True}
idx = scan_grib(
"s3://noaa-hrrr-bdp-pds/hrrr.20230630/conus/hrrr.t00z.wrfsfcf01.grib2",
storage_options=so,
# filter=afilter,
)
df = pd.DataFrame(
[i["refs"]["latitude/0.0"][1:] for i in idx], columns=["startByte", "bytes"]
)
df["endByte"] = df.bytes.cumsum()
df["varName"] = [list(i["refs"].keys())[3].split("/")[0] for i in idx]
df
However, scanning the full file took ~4 minutes! If you use the filters you can cut this down to about 20 seconds.
As far as I can tell, it uses eccodes, so the naming convention doesn't follow wgrib2, which Herbie users are familiar with.
I keep hearing about kerchunk. I'm certain people smarter than me are working on it and would provide efficient data access. Perhaps Herbie (and GOES-2-go) could benefit from using it.
https://github.com/fsspec/kerchunk