Refactor stacked climate dataset

prakhar6sharma commented 1 year ago

Issues fixed by this PR:

StackedClimateDataset is hard coded to work for Downscaling only because it has a different return signature than the ERA5 and ClimateDataset. Specifically, Downscaling assumes that the raw_data it would receive would be a list with 2 items, one for input and other for output. Thus limiting us with a single dataset for input and single dataset for output.
Imports under the data/ folder were all absolute imports.
Forecasting's create_copy() allows illegal values for some of it's attributes

Solution implemented:

All ClimateDataset has a new attribute called name. Thus, all variables are indexed by f"{dataset_name}:{variable}".
StackedClimateDataset now allows recursively stacking arbitrary ClimateDataset's while keeping the same return signature. Specifically, if it has two child datasets named "child1" and "child2" and it's own name is "parent", now the name of the children are changed to "parent:child1" and "parent:child2".
Because of the above change, the in_vars, out_vars, and constants of the Task should have the dataset name followed by a colon followed by the variable name. Example: ["2m_temperature", "geopotential"] to ["era5:2m_temperature", "era5:geopotential"].
Imports are changed under the data/ folder to follow relative paths.
Alphabetize imports.
Implemented create_copy() for Forecasting.

prakhar6sharma commented 1 year ago

To Do:

[x] Update the notebooks.

prakhar6sharma commented 1 year ago

I'm 50-50 on this, so you could convince me either way, but why does name need to be a parameter for ClimateDatasetArgs. For example, if I use ERA5Args, it just makes sense that the name is "era5". A compromise solution could be that the default name is "era5", but we still allow the user to specify a different name.

Why name need to be a parameter for ClimateDatasetArgs: Suppose we are doing downscaling. For the typical setup that we support, both high_res and low_res data are from ERA5. Thus keeping the name as "era5" for both of them would lead to ambiguity. Right now, there is a default name for every different ClimateDatasetArgs class, for ERA5Args that happen to be "era5".

aditya-grover / climate-learn

Refactor stacked climate dataset #103