NickGeneva commented 9 months ago

Earth-2 MIP Pull Request

Description

Data sources will pipe from time, channel -> xarray dataarray which will then converted to tenor, metadata for pipelines.

Big refactor of initial conditions / data sources.

Creating light weight protocol for data-source to improve integration of custom data sources
Improving GFS to updated API w/ caching + Pangu support!
Adding local xarray data source for netcdf files
Adding first pass of lexicon for e2mip channel ids
Adding updated CDS data source with cleaned up implementation
Added updated IFS using ECMWF's package

What I wont do:

Multi-processing. I'm setting stuff up to hopefully allow MP downloads for remote stores but thats for another PR
Distributed safe guards, out of scope for data sources, logic should be handed in a util function or wrapper.

Closes: https://github.com/NVIDIA/earth2mip/issues/127 Closes: https://github.com/NVIDIA/earth2mip/issues/131

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[ ] The documentation is up to date with these changes.
[x] The CHANGELOG.md is up to date with these changes.
[x] An issue is linked to this pull request.

Dependencies

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

nbren12 commented 9 months ago

I would be interested to understand some more of the motivation for this. In particular because it rolls back a conscious decision to avoid cfgrib and xarray across the data sources more broadly. See https://github.com/NVIDIA/earth2mip/pull/64.

Also, the main "lexicon" used in the code base was the ECMWF grib code table: https://codes.ecmwf.int/grib/param-db/. See https://github.com/NVIDIA/earth2mip/blob/a17fd31ae15b83a052c57c88eb30a153d2995415/earth2mip/initial_conditions/cds.py#L43. We didn't use this in the other data sources yet, but the numeric code is far less ambiguous than the short names. The conversion to/from our channel names is handle like this:

code = cds.parse_channel('z500')
assert code.id == 129
assert code.level == 500
assert str(code) == 'z500'

This played into my choice to use the low level grib api in the cds.DataSource since it makes it trivial to extract the raw parameter ids directly from the grib data. The behavior of cfgrib in mapping param ID to name was less predictable, which is why I opted for eccodes.

nbren12 commented 9 months ago

Another disadvantage is that the new lexicon approach doesn't support arbitrary levels. only ones in the defined "lexicon". Also, concerned the new CDS data source is much slower since it doesn't combine pressure levels. that was my main motivation for rewriting the cds.DataSource.

I do like the __call__ API which includes channel_names.

In summary, would like to see the following changes before replacing the existing initial conditions:

no outputting xarrays
mostly revert the cds impementation to initial_conditions.cds.DataSource. sorry, but a lot of these changes are things I speciflcally undid in https://github.com/nbren12/earth2mip/commit/4b33f64cba2c4edf5ed67fe1dea69acfff4c84e8.
for lexicons. Use cds.parse_channel and ECMWF parameter IDs instead of dictionaries of strings.

nbren12 commented 9 months ago

Also, this uses numpy docstrings...I thought we decided to do google style.

nbren12 commented 9 months ago

Assuming xarray is important (maybe some asked for this), we could make a helper function or method like this:

def get_dataarray_from_data_source(datasource, time, channel_names) -> xarray.Dataset:
    return xarray.DataArray(datasource(time, channel_names), dim=["channel", "lat", "lon"], coords={"lat": datasource.grid.lat, ...}

NVIDIA / earth2mip

Initial Condition Refactor #132

Earth-2 MIP Pull Request

Description

Checklist

Dependencies