JackKelly / hypergrib

Lazily open petabyte-scale GRIB datasets consisting of trillions of GRIB messages from xarray
MIT License
23 stars 0 forks source link

FEATURE: Parse GRIB `.idx` files #11

Closed JackKelly closed 3 weeks ago

JackKelly commented 2 months ago

.idx files for GEFS have this form:

<message number>:<byte_offset>:d=<init date in YYYYMMDDHH>:<variable>:<vertical level>:<forecast step>:<ensemble member>

For example:

jack@jack-NUC:~/dev/rust/hypergrib$ head gec00.t00z.pgrb2af000.idx 
1:0:d=2017010100:HGT:10 mb:anl:ENS=low-res ctl
2:50487:d=2017010100:TMP:10 mb:anl:ENS=low-res ctl
3:70653:d=2017010100:RH:10 mb:anl:ENS=low-res ctl
4:81565:d=2017010100:UGRD:10 mb:anl:ENS=low-res ctl
5:104906:d=2017010100:VGRD:10 mb:anl:ENS=low-res ctl
6:125690:d=2017010100:HGT:50 mb:anl:ENS=low-res ctl
7:184420:d=2017010100:TMP:50 mb:anl:ENS=low-res ctl
8:208654:d=2017010100:RH:50 mb:anl:ENS=low-res ctl
9:232073:d=2017010100:UGRD:50 mb:anl:ENS=low-res ctl
10:281494:d=2017010100:VGRD:50 mb:anl:ENS=low-res ctl
JackKelly commented 2 months ago

941672474806491952414ac1468afdd391970182

JackKelly commented 2 months ago

Next steps:

JackKelly commented 2 months ago

Actually, I might build my own to start with. Because I'm not planning to use chunks to start with.

Can I use BTreeSets? One set for each dimension. Each set contains a ref to the message struct. But each set uses a different function to sort the sets??

JackKelly commented 2 months ago

Actually, I might build my own to start with. Because I'm not planning to use chunks to start with.

Can I use BTreeSets? One set for each dimension. Each set contains a ref to the message struct. But each set uses a different function to sort the sets??

Or maybe BTreeMaps, one for each dim?

JackKelly commented 2 months ago

Some thoughts on storing the manifest in memory:

Option 1: Multiple BTreeMap<dimension_coord_type, BTreeSet<&GribMessage>>s. One per dimension.

For example, the BTreeMap for the init_time dimension might look like this:

(2020-01-01T00, {refs to all grib messages with this init time})
(2020-01-01T06, {refs to all grib messages with this init time})

To find the appropriate set of grib messages for a query, we'd loop through the BTreeMap for each dim, to get a set of refs to all grib messages, and find the intersection between the sets.

But this requires an enormous amount of duplication. For example, every grib message has every "step" (so "step 0" will map to a set of all grib messages; as will "step 1", etc.).

Option 2: Hierarchy

Basically mimic the directory hierarchy that's usually used to store NWPs. Something like:

init_time / step / variable / vertical_level / ensemble_member

But this will require lots of loops, I think?

Option 3: Use DuckDB :slightly_smiling_face:

This probably requires the least code for me to write. So perhaps this is the most appropriate for the MVP?

Maybe write an extension for DuckDB so DuckDB can directly ingest .idx files?! Although the primary language for DuckDB extensions is C++ (see the extension-template). But there is work ongoing to write extensions in Rust. It sounds like it is just about possible to write extensions today in Rust. But it might be best to wait. TL;DR: I probably shouldn't write a DuckDB extension for my first pass!

JackKelly commented 1 month ago

I'm using DuckDB! Very impressed so far!

Next step: Work through the TODOs in crate/hypergrib_manifest/src/lib.rs

JackKelly commented 1 month ago

On reflection, I think I might go back to my original idea of manually writing functions to map from requested index ranges to GRIB messages.

Which probably requires a tree of BTreeMaps, similar to a directory hierarchy.

And some good error reporting for when the mapping fails

JackKelly commented 1 month ago

Next tasks:

JackKelly commented 1 month ago

Rust's HashMap is now based on hashbrown which is very fast. So maybe I should use a HashMap instead of a BTreeMap? Might be faster and I don't have to worry about ordering.

JackKelly commented 1 month ago

New plan: NoHashMap. Instead, we'll algorithmically compute the path of the .idx files and load the .idx files on demand. See https://github.com/JackKelly/hypergrib/discussions/14#discussioncomment-10774330 and also see the new design.md in commit a89d30c79989c206b63313f1e0bff270f6ec17e2

JackKelly commented 3 weeks ago

I was planning to parse the parameter abbreviation strings (e.g. "TMP") into gribberish enums (see https://github.com/mpiannucci/gribberish/pull/62). But implementing a clean way to map from the abbrev string to any parameter type was proving slightly tricky.

So, for the hypergrib MVP, I'll not bother decoding the abbrev strings. Instead I'll just use the abbrev strings to refer to the parameter. Specifically: The coordinate labels passed to xarray will just be the abbrev strings, with no additional metadata about the parameter.

Further down the line, we should definitely give the user more information about each parameter. We could use the GRIB2 tables recorded as .csv files in gdal. Perhaps this could be implemented in Python.

There is the issue that some .idx files (like HRRR) use parameter "abbreviations" like var discipline=0 center=7 local_table=1 parmcat=16 parm=201. That's OK for now because that will just be another string. But we definitely should decode that for the user.

For the MVP, I'll also not decode the vertical level or the ensemble member. In a future version we'll decode these.

For the MVP we will decode the step.

JackKelly commented 3 weeks ago

I'm gonna close this issue and start more focused issues