Inconsistency in `dataset` naming for CORDEX: download vs fixes

ljoakim commented 1 year ago

There seems to be an inconsistency regarding what the dataset key should contain for a CORDEX project dataset.

For download to work, dataset value must contain only the name of the rcm (e.g. RCA4), which is mapped to rcm_name in the facet used for download
In order for the correct dataset fixes to be applied (i.e. for the correct file in _fixes/cordex to be imported), dataset must contain the institute id and rcm name concatenated with a dash (e.g. SMHI-RCA4).

Compare the dataset value in the tests for the two cases, which shows the difference:

Download test (look for the CORDEX test): tests/integration/esgf/test_search_download.py

and any of the CORDEX _fixes tests, e.g.: tests/integration/cmor/_fixes/cordex/test_ichec_ec_earth.py

Running the following recipe completes as a successful run, but the program never enters TimeLongName.fix_metadata(...) in esmvalcore/cmor/_fixes/cordex/cordex_fixes.py which should be used by this dataset:

documentation:
  description: CORDEX download example
  title: CORDEX download example
  authors:
    - righi_mattia

datasets:
- {dataset: RCA4, project: CORDEX, domain: EUR-11,
     exp: historical, ensemble: r12i1p1,
     mip: mon, institute: SMHI, rcm_version: v1,
     driver: ICHEC-EC-EARTH, version: v20131026}

diagnostics:
  map:
    description: EUR-11 map of temperature in January 2000.
    variables:
      tas:
        timerange: 2000/P1M
    scripts:
      script1:
        script: examples/diagnostic.py
        quickplot:
          plot_type: pcolormesh
          cmap: Reds

ljoakim commented 1 year ago

I think of two possible approaches to get around this problem:

Set dataset key to institute-rcm_name (e.g. SMHI-RCA4), and add a new key rcm_name, which can be mapped to the esgf facet. This has the downside of redundancy, since both institute and rcm name then occurs twice.
Set dataset key to rcm_name only (e.g. RCA4), then add the institute value to the module string where the dataset fixes are loaded (esmvalcore/cmor/_fixes/fix.py:get_fixes(...)). However, with the following dataset, this might need some extra thought, when institute contains a list:

- {dataset: CCLM4-8-17, project: CORDEX, domain: EUR-11,
     exp: [historical, rcp26], ensemble: r1i1p1,
     mip: day, institute: [CLMcom, CLMcom-BTU], rcm_version: v1,
     driver: MPI-M-MPI-ESM-LR, version: [v20140515, v20171121]}

(This is data from CORDEX and CORDEX-reklies that is stored locally, so it's not downloaded. Don't know if downloading CORDEX-reklies data is supported at the moment.)

Any input or suggestions on how to proceed? Thanks

bouweandela commented 1 year ago

My first thought would be to just rename the currently implemented fixes so they do not contain the institute anymore, e.g. rename esmvalcore/cmor/_fixes/cordex/cnrm_cerfacs_cnrm_cm5/cnrm_aladin63.py to esmvalcore/cmor/_fixes/cordex/cnrm_cerfacs_cnrm_cm5/aladin63.py, but maybe I'm missing something.

@sloosvel Is there a reason why the institute has been prefixed to the dataset name in the filenames of the currently implemented CORDEX fixes?

sloosvel commented 1 year ago

Yes, we wrote those based on the data available on DKRZ-Levante, which is following the BADC DRS that is specified in the configs:

drs:
  CMIP6: DKRZ
  CMIP5: DKRZ
  CMIP3: DKRZ
  CORDEX: BADC
  obs4MIPs: default
  ana4mips: default
  OBS: default
  OBS6: default
  native6: default
...
CORDEX:
  input_dir:
    default: '/'
    spec: '{domain}/{institute}/{driver}/{exp}/{ensemble}/{dataset}/{rcm_version}/{mip}/{short_name}'
    BADC: '{domain}/{institute}/{driver}/{exp}/{ensemble}/{dataset}/{rcm_version}/{mip}/{short_name}/{version}'
    ESGF: '{project.lower}/output/{domain}/{institute}/{driver}/{exp}/{ensemble}/{dataset}/{rcm_version}/{frequency}/{short_name}/{version}'
  input_file: '{short_name}_{domain}_{driver}_{exp}_{ensemble}_{dataset}_{rcm_version}_{mip}*.nc'
  output_file: '{project}_{dataset}_{rcm_version}_{driver}_{domain}_{mip}_{exp}_{ensemble}_{short_name}'
  cmor_type: 'CMIP5'
  cmor_path: 'cordex'

In DKRZ, an input dir path looks like this:

EUR-11/SMHI/ICHEC-EC-EARTH/historical/r12i1p1/SMHI-RCA4/v1/mon/tas/

and an input filename like this:

tas_EUR-11_ICHEC-EC-EARTH_historical_r12i1p1_SMHI-RCA4_v1_mon_200101-200512.nc

So the dataset tag corresponds to the full SMHI-RCA4 entry. I know the institute tag seems to be repeated, but it's how the data was stored and how the config developer file was set. I don't have access to BADC any more, so I don't know if what's in DKRZ matches what's in there.

By renaming the fixes, an entry for DKRZ that includes the {institute}-{dataset} tags should be added in the configs as well because otherwise the data will not be found any more. Either that or I am not sure if it's possible to ask for DKRZ data to be moved.

bouweandela commented 1 year ago

It looks like there is some mismatch between the facets described in the CORDEX Archive specification document, where they use RCMModelName in the directory:

<activity>/<product>/<Domain>/<Institution>/
<GCMModelName>/<CMIP5ExperimentName>/<CMIP5EnsembleMember>/
<RCMModelName>/<RCMVersionID>/<Frequency>/<VariableName>

and filenames:

VariableName_Domain_GCMModelName_CMIP5ExperimentName_CMIP5EnsembleMember_
RCMModelName_RCMVersionID_Frequency[_StartTime-EndTime].nc

with

RCMModelName (CV to register; model_id) is an identifier of the CORDEX RCM. It consists of the Institution identifier (see above) and a model acronym, connected by a dash (e.g. DMI-HIRHAM5 or SMHI-RCA4).

and how things are organized on ESGF (see available facets here, relevant quotes below), where they call the RCMModelName model_name, judging by the

      "directory_format_template_":[
        "%(root)s/%(project)s/%(product)s/%(domain)s/%(institute)s/%(driving_model)s/%(experiment)s/%(ensemble)s/%(rcm_model)s/%(rcm_version)s/%(time_frequency)s/%(variable)s/%(version)s",154246],

but it is not populated:

      "rcm_model":[],

and instead only the facet rcm_name has values:

      "rcm_name":[
        "ALADIN52",784,
        "ALADIN53",422,
        "ALADIN63",2893,
        "ALADIN64",6,
        "ALARO-0",122,
        "BOM-SDM",199,
        "CCAM",1512,
        "CCAM-1704",1976,
        "CCAM-2008",3580,
        "CCLM-0-9",6,
        "CCLM4-21-2",51,
        "CCLM4-8-17",4428,
        "CCLM4-8-17-CLM3-5",966,
        "CCLM5-0-15",3734,
        "CCLM5-0-2",1794,
        "CCLM5-0-6",1984,
        "CCLM5-0-9",52,
        "CCLM5-0-9-NEMOMED12-3-6",530,
        "COSMO-crCLIM-v1-1",9642,
        "CRCM5",2520,
        "CRCM5-SN",129,
        "CanRCM4",752,
        "DeepESD-EE",32,
        "Eta",40,
        "HIRHAM5",7869,
        "HadGEM3-RA",342,
        "HadREM3-GA7-05",3269,
        "HadRM3P",2269,
        "MAR311",140,
        "MAR36",254,
        "RA",111,
        "RACMO21P",1880,
        "RACMO22E",6261,
        "RACMO22T",1892,
        "RCA4",59366,
        "RCA4-SN",1150,
        "REMO2009",6418,
        "REMO2015",13871,
        "RRCM",645,
        "RegCM4",398,
        "RegCM4-0",22,
        "RegCM4-2",45,
        "RegCM4-3",1436,
        "RegCM4-4",5562,
        "RegCM4-6",3541,
        "RegCM4-7",14780,
        "SNURCM",25,
        "VRF370",85,
        "WRF",364,
        "WRF331",28,
        "WRF331F",315,
        "WRF331G",469,
        "WRF341E",752,
        "WRF341I",785,
        "WRF351",78,
        "WRF360J",5642,
        "WRF360K",5745,
        "WRF360L",1222,
        "WRF361H",83,
        "WRF381P",626],

ljoakim commented 1 year ago

After an offline discussion, we agreed on the following approach. I'll look into fixing this:

The dataset key will only contain the rcm_name (e.g. RCA4 or ALADIN63)
In config-developer.yml, all instances of {dataset} in the CORDEX section will be replaced by {institute}-{dataset}.
All files in esmvalcore/cmor/_fixes/cordex/ will be renamed to only {rcm_name}.py (instead of {institute}-{rcm_name}.py)
Update tests to reflect changes.

If different fixes for datasets with same driver/rcm are required based on what institute produced it, this should be handled in the rcm fix file, and institute could be passed using extra facets. However, as discussed, this should be a rare event and should be dealt with if/when it occurs.

bouweandela commented 1 year ago

Sounds good to me. Note that this change will not be backward compatible for those users who have a custom config-developer.yml file: they will need to update it themselves.

ljoakim commented 1 year ago

A note regarding config-developer.yml: for the CORDEX ESGF entry I will not use {institute}-{dataset} but only {dataset} (i.e. the rcm_name), for the default download behaviour to work. ESMValCore builds the path for downloaded dataset based on the dataset dataset_id from ESGF search result, which has the following template, with only rcm_name:

dataset_id_template_ = cordex.%(product)s.%(domain)s.%(institute)s.%(driving_model)s.%(experiment)s.%(ensemble)s.%(rcm_name)s.%(rcm_version)s.%(time_frequency)s.%(variable)s

So, even if the path on the ESGF node file system is using {institute}-{rcm_name}, when downloaded by ESMValCore it will be put in a path using only {rcm_name}. A bit confusing, but I hope it makes sense.

A question (including @zklaus): Can anyone verify the BADC and spec entries will require {institute}-{dataset}?

ESMValGroup / ESMValCore

Inconsistency in `dataset` naming for CORDEX: download vs fixes #2032