Combining experiments/variables with different simulation lengths

Timh37 commented 1 year ago

Functionality needed to combine simulations of models with unusual lengths, e.g., SSP experiments running past 2100 or historical experiments provided only after a given year later than 1850 (happens for e.g., EC-Earth3).

ddict_merged = merge_variables(ddict)

in https://github.com/Timh37/CMIP6cf/blob/main/notebooks/get_CMIP6_gridded_around_tgs_xmip.ipynb currently drops these datasets under the following warning (example):

/srv/conda/envs/notebook/lib/python3.10/site-packages/xmip/postprocessing.py:157: UserWarning: ScenarioMIP.EC-Earth-Consortium.EC-Earth3.ssp585.r117i1p1f1.day.gr.none.psl failed to combine with :cannot align objects with join='exact' where index/labels/sizes are not equal along these coordinates (dimensions): 'time' ('time',)
  warnings.warn(f"{cmip6_dataset_id(ds)} failed to combine with :{e}")

#etc

jbusecke commented 1 year ago

I have faced this before. I usually preselect the time with .sel(time=slice(None, '2100') to get rid of the long running members.

To accomodate 'late start' runs, you could try:

ddict_merged = def merge_variables(ddict, merge_kwargs={'join':'outer'})

I think this should pad missing values with nan. What year do they usually start?

Timh37 commented 1 year ago

I have implemented the preselection with:

def shorten_ssp_runs(ddict,end_year):
    ddict_shortened=ddict
    for k, v in ddict.items():
        if 'ssp' in k:
            ddict_shortened[k] = v.sel(time=slice(None, str(end_year)))
        else:
            ddict_shortened[k] = v
    return ddict_shortened

then found out a remaining issue, which is that for some variants some timesteps are missing that are not missing for others. While it is possible to pad these missing values with nan, as you suggest,

ddict_merged = def merge_variables(ddict, merge_kwargs={'join':'outer'})

that results in large chunks for those variants for which many timesteps are missing. For example, variant r111i1p1f1 of EC-Earth3:

becomes

after merging different members. Note that for EC-Earth3, each year is stored in a separate file on ESGF. Probably, some of these files are missing on Google Cloud. I have added these instances to #2.

Timh37 commented 1 year ago

As an additional issue,

merge_kwargs={'join':'outer'}

also pads missing values with nan where the latitude/longitude coordinates of different member_id's do not exactly agree. For example, I've tried concatenating members r1i1p1f1 and r2i1p1f1 of MPI-ESM1-2-HR, for ssp585. The coordinates of these members are exactly the same, but the Indexes of their latitude differ very slightly (order 10^-14), so that xarray.DataArray.equals returns False and {'join':'exact'} fails on the latitude coordinate. The result is that {'join':'outer'} results in zonal bands of nan's which is problematic for the subsetting.

For now, my workaround is to copy the coordinates of the first matching dataset to the other matching datasets in the custom concatenation of member_id's function.

If I would preprocess with xmip at the start, this issue may not arise, but at the moment that preprocessing fails on renaming the variables I'm querying.

jbusecke commented 1 year ago

If I would preprocess with xmip at the start, this issue may not arise, but at the moment that preprocessing fails on renaming the variables I'm querying.

Not sure that xmip would catch this. Most of that logic works on a 'per dataset' basis.
Is this related to #7 ? Are you sure these warnings result in errors? See my answer there, in my experience this is often a meaningless warning. If not, I would love to see an example.

Generally to fix both the time alignment and use override the coordinates for lon/lat you could use several calls to xr.align and using the exclude argument to only align a subset of dimensions. This is a bit hacky but should work at least for the lon/lat alignment issues.

I think for the large time chunks your intution about missing files seems plausible. Are you able to move ahead on this without the dataset in question?

Hopefully I will be able to make some progress on #2 soon and then this might go away.

Timh37 commented 1 year ago

Regarding aligning lon/lat separately, this seems to work:

def align_lonlat(ds_list):

    aligned_ds_list = []

    for ds in ds_list:
        a,b = xr.align(ds_list[0],ds,join='override',exclude=['time','member_id'])
        aligned_ds_list.append(b)

    return aligned_ds_list

but doesn't feel very optimal. I can't figure out how to pass a list of datasets to xr.align.

With regards to #2, I think this issue will partially persist because some ESGF runs don't start in the same year, even if we would have them complete on the cloud. I'll test if dask.config.set(**{'array.slicing.split_large_chunks': True}): helps.

Timh37 commented 1 year ago

When preselecting periods, filtering out incomplete datasets and regridding first before combining members this is no longer an issue, so closing it.

Timh37 / CMIP6cex

Combining experiments/variables with different simulation lengths #5