Create an 'EC-Earth CMIP6 data request' json for each MIP experiment

treerink commented 5 years ago

With 'EC-Earth CMIP6 data request' I mean the subset of CMIP6 requested variables for a certain MIP experiment which indeed can be produced by EC-Earth3.

If this 'EC-Earth CMIP6 data request' is written to a json file it can be easily used as the data request file at time of cmorization, it can be easily diffed and it can be copied in the namelist subdir of each MIP experiment and thus archived at the EC-Earth svn repository. The latter wouldn't be a good idea with the *.xlsx data request files.

treerink commented 5 years ago

I think it is the easiest to create this file with checkvars.py because there all model components are considered.

zklaus commented 5 years ago

Hi @treerink, what is the situation here? At SMHI we are in the process of settling on on-the-fly generated xlsx files for the data request. Basically we want to use

drq -m _all_ -e piControl --xls

where we change the experiment, of course, but keep -m _all_ for all runs.

That means we need one data request file per experiment, regardless of the involved mips, multiplied by the configurations.

I guess we should make a decision one way or the other (perhaps in the TWG?) and then document this so that everyone can approach this in the same way. What do you think?

treerink commented 5 years ago

@zklaus the original idea of producing a json variant of the data request which then only includes the variables which are requested for a certain experiment AND which can be produced by the used EC-Earth3 model configuration and archiving this in the control output sub directories for each experiment would be the most convenient. The difficulty here, which hindered us to quickly implement this, is again the "preference" issue (also referenced here as double counting issue).

The whole bench of original xlsx CMIP6 data request files are of course produced by genecec at the moment I produce the control output files, so those I have and in principle I could these share easily but xlsx files are not nice to archive under svn because they won't give a svn diff (they are difficult to diff anyway, though possible to certain extent) and their size. The latter would not be nice because there are quite a lot of experiments.

aearamos commented 5 years ago

I've been thinking about this issue and we also discussed the xlsx files here at BSC. It would be nice to have the xlsx tables and/or the .json files that should be used by ece2cmor3 to cmorize each one of the MIPs in the ctrl folder. We could use this file as a reference for that MIP, assuming it was generated by the Data Request and has the correct variables. Right now our idea was to have the ppt/xml files in runtime/ctrl and the tables somewhere else, but I'm not sure this is the best approach. If we had a reliable table inside each folder, for DCPP, piControl, OMIP, etc., we can just point ece2cmor.py to that file.

treerink commented 5 years ago

See also the discussion in #224. The solution of this issue to provide json data request files depends on a solution for the double-counting variables with a preference file.

treerink commented 5 years ago

We just discussed the general design if and how we will create the json data request file and where it will be archived.

We noted that for a joint data request like for the Core MIP experiments run by the AOGCM version (the joined request of these 10 MIPs) the activity_id is CMIP and that this means we can jointly upload this joined CMIP data for each EC-Earth model configuration. The same applies for only data requesting MIPs like CORDEX if they request data within e.g. ScenarioMIP, then the activity_id is ScenarioMIP. In a third case, in which experiments are shared across MIPs, I understand the MIPs can be listed in a certain order in the activity_id, seperated by a single space.

There will be created an additional script (which will be called for each experiment by genecec) which reads the general (joined) .xlsx data request file (as created by drq during running genecec) and uses the taskloader to omit the variable - table combination which are in the ignored list for EC-Earth3 and the tasks will be matched against a preference file in order to account for the double counting variables #224. This new script will thus need two arguments: 1. The .xlsx data request file 2. The EC-Earth3 model configuration (e.g. EC-Earth3-AOGCM). The name of generated json data request file will be labeled by the Earth3 model configuration, and in a few cases where a MIP is run by more than one Earth3 model configuration, there will be more than one json data request file in the control output directory. Note however that for the Core MIP there is already a separation per Earth3 model configuration, so only one json data request file will end up in these directories.

The control output files themselves won't be made preference (i.e. Earth3 model configuration) specific, in order to keep the design clear, on costs of a very limited tiny bit of additional (useless) output.

zklaus commented 5 years ago

We noted that for a joint data request like for the Core MIP experiments run by the AOGCM version (the joined request of these 10 MIPs) the activity_id is CMIP and that this means we can jointly upload this joined CMIP data for each EC-Earth model configuration. The same applies for only data requesting MIPs like CORDEX if they request data within e.g. ScenarioMIP, then the activity_id is ScenarioMIP. In a third case, in which experiments are shared across MIPs, I understand the MIPs can be listed in a certain order in the activity_id, seperated by a single space.

This sounds good. Indeed, the activity_id only depends on the experiment_id.

There will be created an additional script (which will be called for each experiment by genecec) which reads the general (joined) .xlsx data request file (as created by drq during running genecec) and uses the taskloader to omit the variable - table combination which are in the ignored list for EC-Earth3 and the tasks will be matched against a preference file in order to account for the double counting variables #224.

Sounds good.

This new script will thus need two arguments: 1. The .xlsx data request file 2. The EC-Earth3 model configuration (e.g. EC-Earth3-AOGCM).

Wrt the configurations, note that this is CMIP6 controlled vocabulary as source_id. Hence we should stick to the exact spelling of the official list which is

EC-Earth3
EC-Earth3-AerChem
EC-Earth3-CC
EC-Earth3-GrIS
EC-Earth3-HR
EC-Earth3-LR
EC-Earth3-Veg
EC-Earth3-Veg-LR

Note the capitalization, the presence of the 3, the absence of an explicit -AOGCM version (which is the version without a suffix) and the spelling of GrIS.

The name of generated json data request file will be labeled by the Earth3 model configuration, and in a few cases where a MIP is run by more than one Earth3 model configuration, there will be more than one json data request file in the control output directory. Note however that for the Core MIP there is already a separation per Earth3 model configuration, so only one json data request file will end up in these directories. The control output files themselves won't be made preference (i.e. Earth3 model configuration) specific, in order to keep the design clear, on costs of a very limited tiny bit of additional (useless) output.

I'm not sure I understand how treating the CMIP experiments differently from the others simplifies things, but I guess you are in the better position to judge that.

treerink commented 5 years ago

Subtasks for this issue

[x] Add prefs file to resources
[x] Add script drq2varlist to repo
[x] Adapt genecec to generate varlists
[x] Integrate prefs file in drq2varlist
[x] Adapt ece2cmor script(s) to use component-wise varlists

treerink commented 5 years ago

When running:

./drq2varlist.py --drq cmip6-data-request/cmip6-data-request-m\=CMIP.DCPP.LS3MIP.PAMIP.RFMIP.ScenarioMIP.VolMIP.CORDEX.DynVar.SIMIP.VIACSAB-e\=piControl-t\=1-p\=1/cmvme_cm.co.dc.dy.ls.pa.rf.sc.si.vi.vo_piControl_1_1.xlsx --ececonf nemo,ifs

I get the following additions when changing from e46cc12949e40f83dfe01de1edfc757afb5a0f98 to the latest version 3ae9a712f0e06dfb52d051abc400a64a49d1d712:

<             "zg500",

<         ],
<         "AERmon": [
<             "ua"
<         ],
<         "AERmonZ": [
<             "ta"

treerink commented 5 years ago

Hi Gijs,

I get also quite some differences in the output of genecec, i.e. differences in the output control files and the volume estimates when running genecec in the master (there still same as my previous run benchmark) and in the latest version 3ae9a712f0e06dfb52d051abc400a64a49d1d712 in the task-load-prefs branch. I guess this is due to dd454263417cc6b88e81cf42cce0453db48d81c5?

goord commented 5 years ago

Hi @treerink yes I changed the task loader, so this is expected to impact the genecec script. I do expect that it generates more 'double counted' variables, because the realm check was there to prevent such variables. I inserted a new warning whenever a duplicate variable is encountered:

Multiple models found for variable %s, table %s...choosing first but preference needed

so searching for this message may pinpoint to where the script is behaving differently...

treerink commented 5 years ago

The creation of the json cmip6 data request files with drq2varlist.py has been added added to genecec, which means these files are now created for all MIP experiments and if a MIP experiment is carried out by more than one EC-Earth3 model configuration then for each EC-Earth3 model configuration such a cmip6 data request json file is created in the control output file subdirectory of this MIP experiment.

The json cmip6 data request file are also properly produced for the joined CMIP requests.

These json cmip6 data request files will be added in a new branch for the control output file updates and will end up hopefully soon in the trunk of the EC-Earth3 svn repository.

Closing this issue. Please open a new issue if some issue arises with these new json cmip6 data request files.

treerink commented 5 years ago

Note that these new json cmip6 data request files do not include ignored variables, that the preferences for "double counting variables" are applied, and that the file is ordered by model component to make it easy to inspect the files.

zklaus commented 5 years ago

@treerink, @goord great! Thanks a bunch! :+1:

@ufladrich, maybe this can help?

ufladrich commented 5 years ago

Hi @treerink , I'm afraid I'm still confused about the usage of drq2varlist. I have applied it to the xls data request that I was using to cmorise before and then I used --vars instead of --drq when running ece2cmor. However, I get a number of errors like

ERROR:ece2cmor3.taskloader: Found duplicate target mrsos in table 3hr for models lpjg and ifs

and then

CRITICAL:ece2cmor3.taskloader: Duplicate requested variables were found, dismissing all cmorization tasks

No output is produced. (As a side not, the IFS job still goes on doing all the time-consuming grib filtering.) When I manually remove all the duplicated targets and duplicated output names from the varlist json file, I get at least a non-empty task list. What am I doing wrong/missunderstanding?

goord commented 5 years ago

Hi Uwe you aren't doing anything wrong, this is a signal that our "preference" script is incomplete, since it doesn't make a choice between ifs or lpjg for e.g. mrsos.

I will make the preferences complete and add a check for ifs variables before entering the grib filtering

goord commented 5 years ago

Hi @ufladrich or @tommibergman can you post the list of duplicate variables that were reported?

tommibergman commented 5 years ago

I got these:

mrsos mrro mrsol mrso mrros evspsblsoi mrsos

Some of them are doubly mentioned through different tables, but maybe that doesn't matter.

goord commented 5 years ago

Ok I committed a fix in which the above variables will be removed from the lpjguess variable list.

ufladrich commented 5 years ago

I have yet to understand what "preference" means in the context of this issue. @goord when you say above that the "preference script is incomplete", do you mean drq2varlist? And if that is the case, does it mean that the preference logic is build into drq2varlist? What I mean is, how does drq2varlist know that the above variables should be taken from IFS, not LPJG?

ufladrich commented 5 years ago

There are two more duplicated targets:

ERROR:ece2cmor3.taskloader: Found duplicate target tsl in table Lmon for models lpjg and ifs
ERROR:ece2cmor3.taskloader: Found duplicate target tsl in table 6hrPlevPt for models lpjg and ifs

and some duplicated output names:

ERROR:ece2cmor3.taskloader: Found duplicate output name for targets ua, ua7h in table 6hrPlevPt for model ifs
ERROR:ece2cmor3.taskloader: Found duplicate output name for targets va, va7h in table 6hrPlevPt for model ifs
ERROR:ece2cmor3.taskloader: Found duplicate output name for targets ta, ta7h in table 6hrPlevPt for model ifs
ERROR:ece2cmor3.taskloader: Found duplicate output name for targets zg7h, zg27 in table 6hrPlevPt for model ifs

I'm not sure what to think about the latter, according to the CMIP6-CMOR tables the duplication is okay.

treerink commented 5 years ago

It means that the resources/prefs.py is not yet covering all duplicate variables. The infrastructure is there but we have still to make sure all duplicate variables are covered in the prefs.py file, and there we usually need the feedback of the scientists.

goord commented 5 years ago

Hi @ufladrich the preference script is here. It is just a python function that determines which variables to keep for which configurations and which to dismiss.

Yes the preference logic is called from drq2varlist. This script gathers all variables that any EC-Earth component could produce, and then runs all of them through the preference function that determines whether to keep it or not. This procedure is supposed to yield a unique set of variables for all data requests and all EC-Earth configurations.

Whenever you call ece2cmor with the --drq option, it does a drq2varlist first and then a cmorization with the component-wise variable set. It performs a check on the latter to ensure there are no duplicates, because that may give rise to files being overwritten.

BTW whenever calling ece2cmor with --drq option or drq2varlist, it is best to give also a target EC-Earth configuration (use --help to get a list of those), because that can be used to determine the preference and hence reduces the chance of ending up with duplicates.

goord commented 5 years ago

The duplication of ua, ua7h etc. is a problem because it will cause overwritten variables since the output file names for these variables are identical (see issue #334 ). I believe they have different priorities, and we should decide which ones to keep.

treerink commented 5 years ago

@goord the changes in e0e8dc576098f8a066a36c2088798e00894fcafe so the extension of the prefs.py does change the json data request files, for a part as expected, but I am also partly surprised by rather long lists of changes.

goord commented 5 years ago

Hi @treerink the biggest change is the removal of variables for components that are not in the ec-earth configuration. I figured that e,g, AOGCM experiments should not be bothered with duplicates from e.g. land-surface or tm5 right? This will give a lot of removed variables I guess, I would expect entire blocks of component variables to be removed for certain configurations.

treerink commented 5 years ago

Hi @goord,

Ok, that seems indeed the case. I just show one example below, can you check thisdiff _latest_ _previous_ and agree?

71a72
>             "evspsblsoi",
116c117,199
<     "lpjg": {},
---
>     "lpjg": {
>         "Amon": [
>             "fco2antt",
>             "fco2nat"
>         ],
>         "Emon": [
>             "cSoil",
>             "mrsol",
>             "treeFracNdlDcd",
>             "treeFracBdlEvg",
>             "treeFracBdlDcd",
>             "grassFracC3",
>             "grassFracC4",
>             "pastureFracC3",
>             "pastureFracC4",
>             "nep",
>             "fLuc",
>             "cWood",
>             "nwdFracLut",
>             "fracLut",
>             "vegFrac",
>             "treeFracNdlEvg",
>             "cropFracC3",
>             "cropFracC4"
>         ],
>         "Eyr": [
>             "treeFrac",
>             "grassFrac",
>             "shrubFrac",
>             "cropFrac",
>             "vegFrac",
>             "baresoilFrac",
>             "fracOutLut",
>             "fracInLut",
>             "fracLut"
>         ],
>         "Lmon": [
>             "mrsos",
>             "mrso",
>             "mrros",
>             "mrro",
>             "prveg",
>             "evspsblveg",
>             "evspsblsoi",
>             "tran",
>             "tsl",
>             "treeFrac",
>             "grassFrac",
>             "shrubFrac",
>             "cropFrac",
>             "pastureFrac",
>             "baresoilFrac",
>             "residualFrac",
>             "cVeg",
>             "cLitter",
>             "cProduct",
>             "lai",
>             "gpp",
>             "ra",
>             "npp",
>             "rh",
>             "fFire",
>             "fGrazing",
>             "fHarvest",
>             "nbp",
>             "fVegLitter",
>             "fLitterSoil",
>             "cLeaf",
>             "cRoot",
>             "cCwd",
>             "cLitterAbove",
>             "cLitterBelow",
>             "cSoilFast",
>             "cSoilMedium",
>             "cSoilSlow",
>             "landCoverFrac",
>             "rGrowth",
>             "rMaint"
>         ],
>         "day": [
>             "mrso"
>         ]
>     },
282c365,378
<     "tm5": {}
---
>     "tm5": {
>         "AERmon": [
>             "abs550aer",
>             "od550aer"
>         ],
>         "Amon": [
>             "o3",
>             "o3Clim",
>             "ch4",
>             "ch4Clim",
>             "ch4global",
>             "ch4globalClim"
>         ]
>     }

goord commented 5 years ago

So this is for the AOGCM configuration I assume? Yes evspsblsoi was removed from the ifs parameters (Andrea pointed out it cannot be produced by ifs) and the other ones are not in the AOGCM configuration, so I expect them to be gone.

goord commented 5 years ago

So @ufladrich and @tommibergman if you run drq2varlist or ece2cmor with the --drq option and you don't want to be bothered with duplicates from other submodels than your targeted EC-Earth configuration, you have to provide your configuration, e.g.

ece2varlist --drq <something.xlsx> --ececonf EC-EARTH-AOGCM

to remove all variables not in ifs or nemo.

goord commented 5 years ago

@treerink I removed tsl from lpjguess in the prefs.py and fixed a bug concerning EC-EARTH-CC so you may want to regenerate the json files...

ufladrich commented 5 years ago

I had --ececonf EC-EARTH-Veg in my earlier tests.

treerink commented 5 years ago

Done, the current latest version of the control output files in the r6705-control-output-files branch do contain these changes.

treerink commented 5 years ago

I think we can (nearly) close this issue.

The only sub issue I am not sure whether it is solved by now is this one about "duplication of ua, ua7h etc. which is a problem because variables will be overwritten".

treerink commented 5 years ago

A separate issue is created in #422 for the last sub issue mentioned above.

Closing this issue.

EC-Earth / ece2cmor3

Create an 'EC-Earth CMIP6 data request' json for each MIP experiment #253