Closed ljoakim closed 1 year ago
I think of two possible approaches to get around this problem:
Set dataset
key to institute-rcm_name
(e.g. SMHI-RCA4
), and add a new key rcm_name
, which can be mapped to the esgf facet. This has the downside of redundancy, since both institute and rcm name then occurs twice.
Set dataset
key to rcm_name
only (e.g. RCA4
), then add the institute
value to the module string where the dataset fixes are loaded (esmvalcore/cmor/_fixes/fix.py:get_fixes(...)
). However, with the following dataset, this might need some extra thought, when institute
contains a list:
- {dataset: CCLM4-8-17, project: CORDEX, domain: EUR-11,
exp: [historical, rcp26], ensemble: r1i1p1,
mip: day, institute: [CLMcom, CLMcom-BTU], rcm_version: v1,
driver: MPI-M-MPI-ESM-LR, version: [v20140515, v20171121]}
(This is data from CORDEX and CORDEX-reklies that is stored locally, so it's not downloaded. Don't know if downloading CORDEX-reklies data is supported at the moment.)
Any input or suggestions on how to proceed? Thanks
My first thought would be to just rename the currently implemented fixes so they do not contain the institute anymore, e.g. rename esmvalcore/cmor/_fixes/cordex/cnrm_cerfacs_cnrm_cm5/cnrm_aladin63.py
to esmvalcore/cmor/_fixes/cordex/cnrm_cerfacs_cnrm_cm5/aladin63.py
, but maybe I'm missing something.
@sloosvel Is there a reason why the institute has been prefixed to the dataset name in the filenames of the currently implemented CORDEX fixes?
Yes, we wrote those based on the data available on DKRZ-Levante, which is following the BADC DRS that is specified in the configs:
drs:
CMIP6: DKRZ
CMIP5: DKRZ
CMIP3: DKRZ
CORDEX: BADC
obs4MIPs: default
ana4mips: default
OBS: default
OBS6: default
native6: default
...
CORDEX:
input_dir:
default: '/'
spec: '{domain}/{institute}/{driver}/{exp}/{ensemble}/{dataset}/{rcm_version}/{mip}/{short_name}'
BADC: '{domain}/{institute}/{driver}/{exp}/{ensemble}/{dataset}/{rcm_version}/{mip}/{short_name}/{version}'
ESGF: '{project.lower}/output/{domain}/{institute}/{driver}/{exp}/{ensemble}/{dataset}/{rcm_version}/{frequency}/{short_name}/{version}'
input_file: '{short_name}_{domain}_{driver}_{exp}_{ensemble}_{dataset}_{rcm_version}_{mip}*.nc'
output_file: '{project}_{dataset}_{rcm_version}_{driver}_{domain}_{mip}_{exp}_{ensemble}_{short_name}'
cmor_type: 'CMIP5'
cmor_path: 'cordex'
In DKRZ, an input dir path looks like this:
EUR-11/SMHI/ICHEC-EC-EARTH/historical/r12i1p1/SMHI-RCA4/v1/mon/tas/
and an input filename like this:
tas_EUR-11_ICHEC-EC-EARTH_historical_r12i1p1_SMHI-RCA4_v1_mon_200101-200512.nc
So the dataset
tag corresponds to the full SMHI-RCA4
entry. I know the institute tag seems to be repeated, but it's how the data was stored and how the config developer file was set. I don't have access to BADC any more, so I don't know if what's in DKRZ matches what's in there.
By renaming the fixes, an entry for DKRZ that includes the {institute}-{dataset}
tags should be added in the configs as well because otherwise the data will not be found any more. Either that or I am not sure if it's possible to ask for DKRZ data to be moved.
It looks like there is some mismatch between the facets described in the CORDEX Archive specification document, where they use RCMModelName
in the directory:
<activity>/<product>/<Domain>/<Institution>/
<GCMModelName>/<CMIP5ExperimentName>/<CMIP5EnsembleMember>/
<RCMModelName>/<RCMVersionID>/<Frequency>/<VariableName>
and filenames:
VariableName_Domain_GCMModelName_CMIP5ExperimentName_CMIP5EnsembleMember_
RCMModelName_RCMVersionID_Frequency[_StartTime-EndTime].nc
with
RCMModelName (CV to register; model_id) is an identifier of the CORDEX RCM. It consists of the Institution identifier (see above) and a model acronym, connected by a dash (e.g. DMI-HIRHAM5 or SMHI-RCA4).
and how things are organized on ESGF (see available facets here, relevant quotes below), where they call the RCMModelName
model_name
, judging by the
"directory_format_template_":[
"%(root)s/%(project)s/%(product)s/%(domain)s/%(institute)s/%(driving_model)s/%(experiment)s/%(ensemble)s/%(rcm_model)s/%(rcm_version)s/%(time_frequency)s/%(variable)s/%(version)s",154246],
but it is not populated:
"rcm_model":[],
and instead only the facet rcm_name
has values:
"rcm_name":[
"ALADIN52",784,
"ALADIN53",422,
"ALADIN63",2893,
"ALADIN64",6,
"ALARO-0",122,
"BOM-SDM",199,
"CCAM",1512,
"CCAM-1704",1976,
"CCAM-2008",3580,
"CCLM-0-9",6,
"CCLM4-21-2",51,
"CCLM4-8-17",4428,
"CCLM4-8-17-CLM3-5",966,
"CCLM5-0-15",3734,
"CCLM5-0-2",1794,
"CCLM5-0-6",1984,
"CCLM5-0-9",52,
"CCLM5-0-9-NEMOMED12-3-6",530,
"COSMO-crCLIM-v1-1",9642,
"CRCM5",2520,
"CRCM5-SN",129,
"CanRCM4",752,
"DeepESD-EE",32,
"Eta",40,
"HIRHAM5",7869,
"HadGEM3-RA",342,
"HadREM3-GA7-05",3269,
"HadRM3P",2269,
"MAR311",140,
"MAR36",254,
"RA",111,
"RACMO21P",1880,
"RACMO22E",6261,
"RACMO22T",1892,
"RCA4",59366,
"RCA4-SN",1150,
"REMO2009",6418,
"REMO2015",13871,
"RRCM",645,
"RegCM4",398,
"RegCM4-0",22,
"RegCM4-2",45,
"RegCM4-3",1436,
"RegCM4-4",5562,
"RegCM4-6",3541,
"RegCM4-7",14780,
"SNURCM",25,
"VRF370",85,
"WRF",364,
"WRF331",28,
"WRF331F",315,
"WRF331G",469,
"WRF341E",752,
"WRF341I",785,
"WRF351",78,
"WRF360J",5642,
"WRF360K",5745,
"WRF360L",1222,
"WRF361H",83,
"WRF381P",626],
After an offline discussion, we agreed on the following approach. I'll look into fixing this:
dataset
key will only contain the rcm_name
(e.g. RCA4
or ALADIN63
)config-developer.yml
, all instances of {dataset}
in the CORDEX
section will be replaced by {institute}-{dataset}
.esmvalcore/cmor/_fixes/cordex/
will be renamed to only {rcm_name}.py
(instead of {institute}-{rcm_name}.py
)If different fixes for datasets with same driver/rcm are required based on what institute produced it, this should be handled in the rcm fix file, and institute could be passed using extra facets. However, as discussed, this should be a rare event and should be dealt with if/when it occurs.
Sounds good to me. Note that this change will not be backward compatible for those users who have a custom config-developer.yml
file: they will need to update it themselves.
A note regarding config-developer.yml
:
for the CORDEX ESGF
entry I will not use {institute}-{dataset}
but only {dataset}
(i.e. the rcm_name
), for the default download behaviour to work. ESMValCore builds the path for downloaded dataset based on the dataset dataset_id
from ESGF search result, which has the following template, with only rcm_name
:
dataset_id_template_ = cordex.%(product)s.%(domain)s.%(institute)s.%(driving_model)s.%(experiment)s.%(ensemble)s.%(rcm_name)s.%(rcm_version)s.%(time_frequency)s.%(variable)s
So, even if the path on the ESGF node file system is using {institute}-{rcm_name}
, when downloaded by ESMValCore it will be put in a path using only {rcm_name}
. A bit confusing, but I hope it makes sense.
A question (including @zklaus): Can anyone verify the BADC
and spec
entries will require {institute}-{dataset}
?
There seems to be an inconsistency regarding what the
dataset
key should contain for a CORDEX project dataset.dataset
value must contain only the name of the rcm (e.g.RCA4
), which is mapped torcm_name
in the facet used for download_fixes/cordex
to be imported),dataset
must contain the institute id and rcm name concatenated with a dash (e.g.SMHI-RCA4
).Compare the
dataset
value in the tests for the two cases, which shows the difference:Download test (look for the CORDEX test):
tests/integration/esgf/test_search_download.py
and any of the CORDEX
_fixes
tests, e.g.:tests/integration/cmor/_fixes/cordex/test_ichec_ec_earth.py
Running the following recipe completes as a successful run, but the program never enters
TimeLongName.fix_metadata(...)
inesmvalcore/cmor/_fixes/cordex/cordex_fixes.py
which should be used by this dataset: