blaylockbk / Herbie

Download numerical weather prediction datasets (HRRR, RAP, GFS, IFS, etc.) from NOMADS, NODD partners (Amazon, Google, Microsoft), ECMWF open data, and the University of Utah Pando Archive System.
https://herbie.readthedocs.io/
MIT License
466 stars 74 forks source link

Proposal to add indexing subsystem #143

Open amotl opened 1 year ago

amotl commented 1 year ago

About

After a few other attempts elsewhere, we started working on an indexer for NWP data once again at ^1. We think it will help for different slicing and querying tasks requested by the community, some of them referenced below. The indexer is based on the excellent Caterva and ironArray libraries, which enable efficient slicing on all dimensions, along with several other features that are ideal in this scenario. More details are described in the next section.

graph TD

    H[Herbie loader]
    Z[Zarr loader]
    H -->IDX
    Z -->IDX

    IDX[NWP indexer]
    Q(Query by all coordinates/dimensions: time, lat, lon)
    R[Xarray result]

    IDX -->Q
    Q -->R

Technologies

There is a newish file format called Zarr that solves the problem of selecting smaller regions of interest without downloading the full file, by chunking the data.

Caterva is a container for multidimensional data that is specially designed to read datasets slices very efficiently. Like other libraries like Zarr, HDF5, or TileDB, Caterva stores the data into multidimensional chunks. Sitting on top of Caterva, ironArray is a C library oriented to compute and handle large multidimensional arrays efficiently.

Synopsis

Setup

git switch indexing
pip install --editable=.[indexing]

Alternative.

pip install --editable='git+https://github.com/earthobservations/Herbie@indexing[indexing]'

Usage

Create index

TIMERANGE = np.arange(
    start=np.datetime64("1987-10-01 08:00"),
    stop=np.datetime64("1987-10-01 10:59"),
    step=datetime.timedelta(hours=1),
)

era5_temp2m = NwpIndex(name="air_temperature_at_2_metres")
era5_temp2m.save(dataset=open_era5_zarr(TEMP2M, 1987, 10, TIMERANGE[0], TIMERANGE[-1]))

Query index

era5_temp2m = NwpIndex(name="air_temperature_at_2_metres")
era5_temp2m.load()

# Single forecasted temperature value at given time and geopoint, in Monterey, in Fahrenheit.
record = (
    nwp.query(time="1987-10-01 08:00", lat=36.6083, lon=-121.8674)
    .kelvin_to_fahrenheit()
    .data
)
# Verify value.
assert record.values == np.array(73.805008, dtype=np.float32)

# Temperatures in Berlin area, within a given time range and bounding box, in Celsius.
ds = (
    era5_temp2m.query(
        time=pd.date_range(
            start="1987-10-01 08:00", end="1987-10-01 09:00", freq="H"
        ),
        location=BBox(lon1=13.000, lat1=52.700, lon2=13.600, lat2=52.300),
    )
    .kelvin_to_celsius()
    .ds
)
# Verify shape.
assert ds["air_temperature_at_2_metres"].shape == (2, 3, 3)

Thoughts

We would like to gather early feedback from you and the community if this feature would be welcome. In the spirit of Herbie's history and disclaimer, we would be honored to contribute this as an indexing subsystem. Please let us know what you think about the patch, and if you also believe it would fit well into the code base, to enable a whole stack of new possibilities for Herbie.

References

Implementation

The query() method of herbie.index.core fame, filtering on all dimensions (time, lat, lon). https://github.com/blaylockbk/Herbie/blob/4fd07a52f40c8c217d9e6de897aa6848d4e33b88/herbie/index/core.py#L101-L125

blaylockbk commented 1 year ago

I appreciate the contribution. Zarr access is something I've wanted to add to Herbie (especially abstracting access to the HRRR Zarr dataset^1). This is a bit to digest at the moment. When the docs are migrated to RTD and restored, I'll start to look at this in more depth. I certainly think Herbie would be a good home to add access to more types of data that are available.