Closed JackKelly closed 3 weeks ago
941672474806491952414ac1468afdd391970182
Next steps:
keys
).Actually, I might build my own to start with. Because I'm not planning to use chunks to start with.
Can I use BTreeSets? One set for each dimension. Each set contains a ref to the message struct. But each set uses a different function to sort the sets??
Actually, I might build my own to start with. Because I'm not planning to use chunks to start with.
Can I use BTreeSets? One set for each dimension. Each set contains a ref to the message struct. But each set uses a different function to sort the sets??
Or maybe BTreeMaps, one for each dim?
Some thoughts on storing the manifest in memory:
BTreeMap<dimension_coord_type, BTreeSet<&GribMessage>>
s. One per dimension.For example, the BTreeMap
for the init_time
dimension might look like this:
(2020-01-01T00, {refs to all grib messages with this init time})
(2020-01-01T06, {refs to all grib messages with this init time})
To find the appropriate set of grib messages for a query, we'd loop through the BTreeMap
for each dim, to get a set of refs to all grib messages, and find the intersection between the sets.
But this requires an enormous amount of duplication. For example, every grib message has every "step" (so "step 0" will map to a set of all grib messages; as will "step 1", etc.).
Basically mimic the directory hierarchy that's usually used to store NWPs. Something like:
init_time / step / variable / vertical_level / ensemble_member
But this will require lots of loops, I think?
This probably requires the least code for me to write. So perhaps this is the most appropriate for the MVP?
Maybe write an extension for DuckDB so DuckDB can directly ingest .idx
files?! Although the primary language for DuckDB extensions is C++ (see the extension-template
). But there is work ongoing to write extensions in Rust. It sounds like it is just about possible to write extensions today in Rust. But it might be best to wait. TL;DR: I probably shouldn't write a DuckDB extension for my first pass!
I'm using DuckDB! Very impressed so far!
Next step: Work through the TODOs in crate/hypergrib_manifest/src/lib.rs
On reflection, I think I might go back to my original idea of manually writing functions to map from requested index ranges to GRIB messages.
Which probably requires a tree of BTreeMaps, similar to a directory hierarchy.
And some good error reporting for when the mapping fails
Next tasks:
GefsKey::try_from<Path>
GefsKey::to_path
struct GefsCoordLabels
Dataset<K, C>
Rust's HashMap
is now based on hashbrown
which is very fast. So maybe I should use a HashMap
instead of a BTreeMap
? Might be faster and I don't have to worry about ordering.
New plan: NoHashMap
. Instead, we'll algorithmically compute the path of the .idx
files and load the .idx
files on demand. See https://github.com/JackKelly/hypergrib/discussions/14#discussioncomment-10774330 and also see the new design.md
in commit a89d30c79989c206b63313f1e0bff270f6ec17e2
I was planning to parse the parameter abbreviation strings (e.g. "TMP") into gribberish
enums (see https://github.com/mpiannucci/gribberish/pull/62). But implementing a clean way to map from the abbrev
string to any parameter type was proving slightly tricky.
So, for the hypergrib
MVP, I'll not bother decoding the abbrev strings. Instead I'll just use the abbrev strings to refer to the parameter. Specifically: The coordinate labels passed to xarray will just be the abbrev strings, with no additional metadata about the parameter.
Further down the line, we should definitely give the user more information about each parameter. We could use the GRIB2 tables recorded as .csv
files in gdal. Perhaps this could be implemented in Python.
There is the issue that some .idx
files (like HRRR) use parameter "abbreviations" like var discipline=0 center=7 local_table=1 parmcat=16 parm=201
. That's OK for now because that will just be another string. But we definitely should decode that for the user.
For the MVP, I'll also not decode the vertical level or the ensemble member. In a future version we'll decode these.
For the MVP we will decode the step.
I'm gonna close this issue and start more focused issues
.idx
files for GEFS have this form:<message number>:<byte_offset>:d=<init date in YYYYMMDDHH>:<variable>:<vertical level>:<forecast step>:<ensemble member>
For example: