Closed Timh37 closed 1 year ago
I have faced this before. I usually preselect the time with .sel(time=slice(None, '2100')
to get rid of the long running members.
To accomodate 'late start' runs, you could try:
ddict_merged = def merge_variables(ddict, merge_kwargs={'join':'outer'})
I think this should pad missing values with nan. What year do they usually start?
I have implemented the preselection with:
def shorten_ssp_runs(ddict,end_year):
ddict_shortened=ddict
for k, v in ddict.items():
if 'ssp' in k:
ddict_shortened[k] = v.sel(time=slice(None, str(end_year)))
else:
ddict_shortened[k] = v
return ddict_shortened
then found out a remaining issue, which is that for some variants some timesteps are missing that are not missing for others. While it is possible to pad these missing values with nan, as you suggest,
ddict_merged = def merge_variables(ddict, merge_kwargs={'join':'outer'})
that results in large chunks for those variants for which many timesteps are missing. For example, variant r111i1p1f1
of EC-Earth3
:
becomes
after merging different members. Note that for EC-Earth3, each year is stored in a separate file on ESGF. Probably, some of these files are missing on Google Cloud. I have added these instances to #2.
As an additional issue,
merge_kwargs={'join':'outer'}
also pads missing values with nan where the latitude/longitude coordinates of different member_id
's do not exactly agree. For example, I've tried concatenating members r1i1p1f1
and r2i1p1f1
of MPI-ESM1-2-HR
, for ssp585
. The coordinates of these members are exactly the same, but the Indexes of their latitude differ very slightly (order 10^-14), so that xarray.DataArray.equals
returns False
and {'join':'exact'}
fails on the latitude coordinate. The result is that {'join':'outer'}
results in zonal bands of nan's which is problematic for the subsetting.
For now, my workaround is to copy the coordinates of the first matching dataset to the other matching datasets in the custom concatenation of member_id's function.
If I would preprocess with xmip
at the start, this issue may not arise, but at the moment that preprocessing fails on renaming the variables I'm querying.
If I would preprocess with xmip at the start, this issue may not arise, but at the moment that preprocessing fails on renaming the variables I'm querying.
Generally to fix both the time alignment and use override the coordinates for lon/lat you could use several calls to xr.align and using the exclude
argument to only align a subset of dimensions. This is a bit hacky but should work at least for the lon/lat alignment issues.
I think for the large time chunks your intution about missing files seems plausible. Are you able to move ahead on this without the dataset in question?
Hopefully I will be able to make some progress on #2 soon and then this might go away.
Regarding aligning lon/lat separately, this seems to work:
def align_lonlat(ds_list):
aligned_ds_list = []
for ds in ds_list:
a,b = xr.align(ds_list[0],ds,join='override',exclude=['time','member_id'])
aligned_ds_list.append(b)
return aligned_ds_list
but doesn't feel very optimal. I can't figure out how to pass a list of datasets to xr.align
.
With regards to #2, I think this issue will partially persist because some ESGF runs don't start in the same year, even if we would have them complete on the cloud. I'll test if dask.config.set(**{'array.slicing.split_large_chunks': True}):
helps.
When preselecting periods, filtering out incomplete datasets and regridding first before combining members this is no longer an issue, so closing it.
Functionality needed to combine simulations of models with unusual lengths, e.g., SSP experiments running past 2100 or historical experiments provided only after a given year later than 1850 (happens for e.g., EC-Earth3).
in https://github.com/Timh37/CMIP6cf/blob/main/notebooks/get_CMIP6_gridded_around_tgs_xmip.ipynb currently drops these datasets under the following warning (example):