Proposal to add indexing subsystem

About

After a few other attempts elsewhere, we started working on an indexer for NWP data once again at ^1. We think it will help for different slicing and querying tasks requested by the community, some of them referenced below. The indexer is based on the excellent Caterva and ironArray libraries, which enable efficient slicing on all dimensions, along with several other features that are ideal in this scenario. More details are described in the next section.

graph TD

    H[Herbie loader]
    Z[Zarr loader]
    H -->IDX
    Z -->IDX

    IDX[NWP indexer]
    Q(Query by all coordinates/dimensions: time, lat, lon)
    R[Xarray result]

    IDX -->Q
    Q -->R

Technologies

There is a newish file format called Zarr that solves the problem of selecting smaller regions of interest without downloading the full file, by chunking the data.

Caterva is a container for multidimensional data that is specially designed to read datasets slices very efficiently. Like other libraries like Zarr, HDF5, or TileDB, Caterva stores the data into multidimensional chunks. Sitting on top of Caterva, ironArray is a C library oriented to compute and handle large multidimensional arrays efficiently.

Synopsis

Setup

git switch indexing
pip install --editable=.[indexing]

Alternative.

pip install --editable='git+https://github.com/earthobservations/Herbie@indexing[indexing]'

Usage

Create index

TIMERANGE = np.arange(
    start=np.datetime64("1987-10-01 08:00"),
    stop=np.datetime64("1987-10-01 10:59"),
    step=datetime.timedelta(hours=1),
)

era5_temp2m = NwpIndex(name="air_temperature_at_2_metres")
era5_temp2m.save(dataset=open_era5_zarr(TEMP2M, 1987, 10, TIMERANGE[0], TIMERANGE[-1]))

Query index

era5_temp2m = NwpIndex(name="air_temperature_at_2_metres")
era5_temp2m.load()

# Single forecasted temperature value at given time and geopoint, in Monterey, in Fahrenheit.
record = (
    nwp.query(time="1987-10-01 08:00", lat=36.6083, lon=-121.8674)
    .kelvin_to_fahrenheit()
    .data
)
# Verify value.
assert record.values == np.array(73.805008, dtype=np.float32)

# Temperatures in Berlin area, within a given time range and bounding box, in Celsius.
ds = (
    era5_temp2m.query(
        time=pd.date_range(
            start="1987-10-01 08:00", end="1987-10-01 09:00", freq="H"
        ),
        location=BBox(lon1=13.000, lat1=52.700, lon2=13.600, lat2=52.300),
    )
    .kelvin_to_celsius()
    .ds
)
# Verify shape.
assert ds["air_temperature_at_2_metres"].shape == (2, 3, 3)

Thoughts

We would like to gather early feedback from you and the community if this feature would be welcome. In the spirit of Herbie's history and disclaimer, we would be honored to contribute this as an indexing subsystem. Please let us know what you think about the patch, and if you also believe it would fit well into the code base, to enable a whole stack of new possibilities for Herbie.

References

GH-6
GH-12
GH-85
GH-92
GH-123
GH-126

Implementation

The query() method of herbie.index.core fame, filtering on all dimensions (time, lat, lon). https://github.com/blaylockbk/Herbie/blob/4fd07a52f40c8c217d9e6de897aa6848d4e33b88/herbie/index/core.py#L101-L125

blaylockbk / Herbie