NVIDIA / earth2mip

Earth-2 Model Intercomparison Project (MIP) is a python framework that enables climate researchers and scientists to inter-compare AI models for weather and climate.
https://nvidia.github.io/earth2mip/
Apache License 2.0
187 stars 41 forks source link

Initial Condition Refactor #132

Closed NickGeneva closed 9 months ago

NickGeneva commented 9 months ago

Earth-2 MIP Pull Request

Description

Data sources will pipe from time, channel -> xarray dataarray which will then converted to tenor, metadata for pipelines.

Big refactor of initial conditions / data sources.

What I wont do:

Closes: https://github.com/NVIDIA/earth2mip/issues/127 Closes: https://github.com/NVIDIA/earth2mip/issues/131

Checklist

Dependencies

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

NickGeneva commented 9 months ago

/blossom-ci

nbren12 commented 9 months ago

I would be interested to understand some more of the motivation for this. In particular because it rolls back a conscious decision to avoid cfgrib and xarray across the data sources more broadly. See https://github.com/NVIDIA/earth2mip/pull/64.

Also, the main "lexicon" used in the code base was the ECMWF grib code table: https://codes.ecmwf.int/grib/param-db/. See https://github.com/NVIDIA/earth2mip/blob/a17fd31ae15b83a052c57c88eb30a153d2995415/earth2mip/initial_conditions/cds.py#L43. We didn't use this in the other data sources yet, but the numeric code is far less ambiguous than the short names. The conversion to/from our channel names is handle like this:

code = cds.parse_channel('z500')
assert code.id == 129
assert code.level == 500
assert str(code) == 'z500'

This played into my choice to use the low level grib api in the cds.DataSource since it makes it trivial to extract the raw parameter ids directly from the grib data. The behavior of cfgrib in mapping param ID to name was less predictable, which is why I opted for eccodes.

nbren12 commented 9 months ago

Another disadvantage is that the new lexicon approach doesn't support arbitrary levels. only ones in the defined "lexicon". Also, concerned the new CDS data source is much slower since it doesn't combine pressure levels. that was my main motivation for rewriting the cds.DataSource.

I do like the __call__ API which includes channel_names.

In summary, would like to see the following changes before replacing the existing initial conditions:

nbren12 commented 9 months ago

Also, this uses numpy docstrings...I thought we decided to do google style.

nbren12 commented 9 months ago

Assuming xarray is important (maybe some asked for this), we could make a helper function or method like this:

def get_dataarray_from_data_source(datasource, time, channel_names) -> xarray.Dataset:
    return xarray.DataArray(datasource(time, channel_names), dim=["channel", "lat", "lon"], coords={"lat": datasource.grid.lat, ...}