ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

concatenate experiments handles time ordering incorrectly #2342

Closed dhohn closed 1 month ago

dhohn commented 8 months ago

I am analysing the 1pctCO2 and esm-1pct-brch-1000PgC experiments, where the latter branches from the former after about 70 years. The simulation time for 1pctCO2 is longer than 70 years, so there will be several years that are present in both experiments. When concatenating both experiments I need to take the first 70 years from 1pctCO2 and the remainder from esm-1pct-brch-1000PgC, essentially discarding the years in 1pctCO2 after year 70.

Thats the default behaviour of ESMValTool and iris, except the above logic operates on a per-file basis rather than on per-experiment. So if the experiment datasets consist of multiple files (which they frequently do), we get nonsensical output that depends on how the files were chunked and saved.

Below is an MRE:

import esmvalcore.dataset
import iris
import iris.quickplot as qplt

ds = esmvalcore.dataset.Dataset( **{
    "dataset": "CESM2",
    "project": "CMIP6", 
    "exp": ["1pctCO2", "esm-1pct-brch-1000PgC"], 
    "ensemble": "r1i1p1f1", 
    "grid": "gn", 
    "timerange": "0001/0217",
    "short_name":"tas",
    "mip":"Amon",
})

qplt.plot(ds.load().collapsed(['latitude', 'longitude'], iris.analysis.MEAN))

which gives the following plot:

image

One can see the discontinuity in the temperature value that is caused by taking some timeranges from the wrong experiment.

There is a single unambiguous way to concatenate 2 experiments with overlapping timestamps in the CMIP context and that is at the branching time. I have a local bugfix that first concatenates the experiments-wise and then concatenates the result of that, which I have pushed to https://github.com/ESMValGroup/ESMValCore/tree/experiments_first (will open a PR now)

There are several issues pertaining to concatenation, e.g. this one #1423 which considerable activity over the years. However the underlying issue persists.

bouweandela commented 8 months ago

Thanks for raising the issue and coming up with a solution!

The problem with the approach in the linked pull request is that many cubes will not have an experiment_id attribute, e.g. because the attribute is named differently (it is called experiment for CMIP3, CMIP5, and CORDEX) or experiments are not meaningful in the context of that project (obs4MIPs, OBS, native6). But maybe it could be made to work if you supported those cases?

I would recommend implementing this without adding an extra argument to the concatenate preprocessor function because users may forget to specify that in the recipe.

dhohn commented 8 months ago

Hi @bouweandela,

Thanks for starting the discussion. Unfortunately I currently and for the foreseeable future only use CMIP6. So suggestions and contributions for making this work with other projects are required. Is there a project property of the datasets thats always present so we could only support CMIP6?

Is the concatenate function exposed to/configurable from the recipe? I didn't think the extra argument would affect anything.

bouweandela commented 8 months ago

Unfortunately I currently and for the foreseeable future only use CMIP6. So suggestions and contributions for making this work with other projects are required. Is there a project property of the datasets thats always present so we could only support CMIP6?

If we implement this kind of feature, I think it should work for all projects that define an experiment or users will complain. The attribute defining the project also differs per project, so it would probably make more sense to check for the two variations of exp that we know, i.e. experiment and experiment_id and gracefully accept cases where there is no such attribute.

Is the concatenate function exposed to/configurable from the recipe? I didn't think the extra argument would affect anything.

Indeed it is not exposed from the recipe, only as part of the Python API.

One further question: how does first concatenating all files per experiment solve the issue of selecting the right data? For your use case, it seems you want the data from the first experiment until the second experiment starts and then the full second experiment, but there may be other cases where the opposite is required?

bouweandela commented 8 months ago

Maybe it would be simpler to just make two datasets and then concatenate them?

import esmvalcore.dataset
import esmvalcore.preprocessor
import iris
import iris.quickplot as qplt

tas_parent_exp = esmvalcore.dataset.Dataset(
    dataset="CESM2",
    project="CMIP6",
    exp="1pctCO2",
    ensemble="r1i1p1f1",
    grid="gn",
    timerange="0001/0070",
    short_name="tas",
    mip="Amon",
)

tas_child_exp = tas_parent_exp.copy(
    exp="esm-1pct-brch-1000PgC",
    timerange="0070/0217",
)

tas = esmvalcore.preprocessor.concatenate([tas_parent_exp.load(), tas_child_exp.load()])
qplt.plot(tas.collapsed(['latitude', 'longitude'], iris.analysis.MEAN))

or, if you're using the recipe interface, define two variables for the respective datasets and then do the concatenate call in the diagnostic script? Or do you need other preprocessor functions that require data from both timeranges?

dhohn commented 8 months ago

I just pushed a change to guard for the existence of any available experiment facet name and check for the one that is available, so that should cover all cases. Please have a look. I would appreciate if someone could take care of the tests, or hold my hand for it.

The concatenate() signature is kept the same.

One further question: how does first concatenating all files per experiment solve the issue of selecting the right data? For your use case, it seems you want the data from the first experiment until the second experiment starts and then the full second experiment, but there may be other cases where the opposite is required?

It works because of _check_time_overlaps() which implements the logic you describe. In the context of time dimension (the only dim supported here) and the linear nature of time, that logic is the only one that makes sense.

Maybe it would be simpler to just make two datasets and then concatenate them?

Sure thats possible, but kinda defeats the purpose of supporting that syntax for auto-concat'ing.