Closed treerink closed 5 years ago
I think it is the easiest to create this file with checkvars.py
because there all model components are considered.
Hi @treerink, what is the situation here? At SMHI we are in the process of settling on on-the-fly generated xlsx files for the data request. Basically we want to use
drq -m _all_ -e piControl --xls
where we change the experiment, of course, but keep -m _all_
for all runs.
That means we need one data request file per experiment, regardless of the involved mips, multiplied by the configurations.
I guess we should make a decision one way or the other (perhaps in the TWG?) and then document this so that everyone can approach this in the same way. What do you think?
@zklaus the original idea of producing a json variant of the data request which then only includes the variables which are requested for a certain experiment AND which can be produced by the used EC-Earth3 model configuration and archiving this in the control output sub directories for each experiment would be the most convenient. The difficulty here, which hindered us to quickly implement this, is again the "preference" issue (also referenced here as double counting issue).
The whole bench of original xlsx CMIP6 data request files are of course produced by genecec
at the moment I produce the control output files, so those I have and in principle I could these share easily but xlsx files are not nice to archive under svn because they won't give a svn diff (they are difficult to diff anyway, though possible to certain extent) and their size. The latter would not be nice because there are quite a lot of experiments.
I've been thinking about this issue and we also discussed the xlsx files here at BSC. It would be nice to have the xlsx tables and/or the .json files that should be used by ece2cmor3 to cmorize each one of the MIPs in the ctrl folder. We could use this file as a reference for that MIP, assuming it was generated by the Data Request and has the correct variables. Right now our idea was to have the ppt/xml files in runtime/ctrl and the tables somewhere else, but I'm not sure this is the best approach. If we had a reliable table inside each folder, for DCPP, piControl, OMIP, etc., we can just point ece2cmor.py to that file.
See also the discussion in #224. The solution of this issue to provide json data request files depends on a solution for the double-counting variables with a preference file.
We just discussed the general design if and how we will create the json data request file and where it will be archived.
We noted that for a joint data request like for the Core MIP experiments run by the AOGCM version (the joined request of these 10 MIPs) the activity_id
is CMIP
and that this means we can jointly upload this joined CMIP data for each EC-Earth model configuration. The same applies for only data requesting MIPs like CORDEX if they request data within e.g. ScenarioMIP, then the activity_id
is ScenarioMIP
. In a third case, in which experiments are shared across MIPs, I understand the MIPs can be listed in a certain order in the activity_id
, seperated by a single space.
There will be created an additional script (which will be called for each experiment by genecec
) which reads the general (joined) .xlsx
data request file (as created by drq
during running genecec)
and uses the taskloader
to omit the variable - table
combination which are in the ignored list for EC-Earth3 and the tasks will be matched against a preference file
in order to account for the double counting
variables #224. This new script will thus need two arguments: 1. The .xlsx
data request file 2. The EC-Earth3 model configuration (e.g. EC-Earth3-AOGCM). The name of generated json data request file will be labeled by the Earth3 model configuration, and in a few cases where a MIP is run by more than one Earth3 model configuration, there will be more than one json data request file in the control output directory. Note however that for the Core MIP there is already a separation per Earth3 model configuration, so only one json data request file will end up in these directories.
The control output files themselves won't be made preference (i.e. Earth3 model configuration) specific, in order to keep the design clear, on costs of a very limited tiny bit of additional (useless) output.
We noted that for a joint data request like for the Core MIP experiments run by the AOGCM version (the joined request of these 10 MIPs) the
activity_id
isCMIP
and that this means we can jointly upload this joined CMIP data for each EC-Earth model configuration. The same applies for only data requesting MIPs like CORDEX if they request data within e.g. ScenarioMIP, then theactivity_id
isScenarioMIP
. In a third case, in which experiments are shared across MIPs, I understand the MIPs can be listed in a certain order in theactivity_id
, seperated by a single space.
This sounds good. Indeed, the activity_id
only depends on the experiment_id
.
There will be created an additional script (which will be called for each experiment by
genecec
) which reads the general (joined).xlsx
data request file (as created bydrq
during runninggenecec)
and uses thetaskloader
to omit thevariable - table
combination which are in the ignored list for EC-Earth3 and the tasks will be matched against apreference file
in order to account for thedouble counting
variables #224.
Sounds good.
This new script will thus need two arguments: 1. The
.xlsx
data request file 2. The EC-Earth3 model configuration (e.g. EC-Earth3-AOGCM).
Wrt the configurations, note that this is CMIP6 controlled vocabulary as source_id
. Hence we should stick to the exact spelling of the official list which is
EC-Earth3
EC-Earth3-AerChem
EC-Earth3-CC
EC-Earth3-GrIS
EC-Earth3-HR
EC-Earth3-LR
EC-Earth3-Veg
EC-Earth3-Veg-LR
Note the capitalization, the presence of the 3, the absence of an explicit -AOGCM
version (which is the version without a suffix) and the spelling of GrIS
.
The name of generated json data request file will be labeled by the Earth3 model configuration, and in a few cases where a MIP is run by more than one Earth3 model configuration, there will be more than one json data request file in the control output directory. Note however that for the Core MIP there is already a separation per Earth3 model configuration, so only one json data request file will end up in these directories. The control output files themselves won't be made preference (i.e. Earth3 model configuration) specific, in order to keep the design clear, on costs of a very limited tiny bit of additional (useless) output.
I'm not sure I understand how treating the CMIP experiments differently from the others simplifies things, but I guess you are in the better position to judge that.
Subtasks for this issue
When running:
./drq2varlist.py --drq cmip6-data-request/cmip6-data-request-m\=CMIP.DCPP.LS3MIP.PAMIP.RFMIP.ScenarioMIP.VolMIP.CORDEX.DynVar.SIMIP.VIACSAB-e\=piControl-t\=1-p\=1/cmvme_cm.co.dc.dy.ls.pa.rf.sc.si.vi.vo_piControl_1_1.xlsx --ececonf nemo,ifs
I get the following additions when changing from e46cc12949e40f83dfe01de1edfc757afb5a0f98 to the latest version 3ae9a712f0e06dfb52d051abc400a64a49d1d712:
< "zg500",
< ],
< "AERmon": [
< "ua"
< ],
< "AERmonZ": [
< "ta"
Hi Gijs,
I get also quite some differences in the output of genecec,
i.e. differences in the output control files and the volume estimates when running genecec
in the master (there still same as my previous run benchmark) and in the latest version 3ae9a712f0e06dfb52d051abc400a64a49d1d712 in the task-load-prefs
branch. I guess this is due to dd454263417cc6b88e81cf42cce0453db48d81c5?
Hi @treerink yes I changed the task loader, so this is expected to impact the genecec script. I do expect that it generates more 'double counted' variables, because the realm check was there to prevent such variables. I inserted a new warning whenever a duplicate variable is encountered:
Multiple models found for variable %s, table %s...choosing first but preference needed
so searching for this message may pinpoint to where the script is behaving differently...
The creation of the json cmip6 data request files with drq2varlist.py
has been added added to genecec
, which means these files are now created for all MIP experiments and if a MIP experiment is carried out by more than one EC-Earth3 model configuration then for each EC-Earth3 model configuration such a cmip6 data request json file is created in the control output file subdirectory of this MIP experiment.
The json cmip6 data request file are also properly produced for the joined CMIP requests.
These json cmip6 data request files will be added in a new branch for the control output file updates and will end up hopefully soon in the trunk of the EC-Earth3 svn repository.
Closing this issue. Please open a new issue if some issue arises with these new json cmip6 data request files.
Note that these new json cmip6 data request files do not include ignored variables, that the preferences for "double counting variables" are applied, and that the file is ordered by model component to make it easy to inspect the files.
@treerink, @goord great! Thanks a bunch! :+1:
@ufladrich, maybe this can help?
Hi @treerink ,
I'm afraid I'm still confused about the usage of drq2varlist
. I have applied it to the xls data request that I was using to cmorise before and then I used --vars
instead of --drq
when running ece2cmor
. However, I get a number of errors like
ERROR:ece2cmor3.taskloader: Found duplicate target mrsos in table 3hr for models lpjg and ifs
and then
CRITICAL:ece2cmor3.taskloader: Duplicate requested variables were found, dismissing all cmorization tasks
No output is produced. (As a side not, the IFS job still goes on doing all the time-consuming grib filtering.) When I manually remove all the duplicated targets and duplicated output names from the varlist json file, I get at least a non-empty task list. What am I doing wrong/missunderstanding?
Hi Uwe you aren't doing anything wrong, this is a signal that our "preference" script is incomplete, since it doesn't make a choice between ifs or lpjg for e.g. mrsos.
I will make the preferences complete and add a check for ifs variables before entering the grib filtering
Hi @ufladrich or @tommibergman can you post the list of duplicate variables that were reported?
I got these:
mrsos mrro mrsol mrso mrros evspsblsoi mrsos
Some of them are doubly mentioned through different tables, but maybe that doesn't matter.
Ok I committed a fix in which the above variables will be removed from the lpjguess variable list.
I have yet to understand what "preference" means in the context of this issue. @goord when you say above that the "preference script is incomplete", do you mean drq2varlist
? And if that is the case, does it mean that the preference logic is build into drq2varlist
? What I mean is, how does drq2varlist
know that the above variables should be taken from IFS, not LPJG?
There are two more duplicated targets:
ERROR:ece2cmor3.taskloader: Found duplicate target tsl in table Lmon for models lpjg and ifs
ERROR:ece2cmor3.taskloader: Found duplicate target tsl in table 6hrPlevPt for models lpjg and ifs
and some duplicated output names:
ERROR:ece2cmor3.taskloader: Found duplicate output name for targets ua, ua7h in table 6hrPlevPt for model ifs
ERROR:ece2cmor3.taskloader: Found duplicate output name for targets va, va7h in table 6hrPlevPt for model ifs
ERROR:ece2cmor3.taskloader: Found duplicate output name for targets ta, ta7h in table 6hrPlevPt for model ifs
ERROR:ece2cmor3.taskloader: Found duplicate output name for targets zg7h, zg27 in table 6hrPlevPt for model ifs
I'm not sure what to think about the latter, according to the CMIP6-CMOR tables the duplication is okay.
It means that the resources/prefs.py
is not yet covering all duplicate variables. The infrastructure is there but we have still to make sure all duplicate variables are covered in the prefs.py
file, and there we usually need the feedback of the scientists.
Hi @ufladrich the preference script is here. It is just a python function that determines which variables to keep for which configurations and which to dismiss.
Yes the preference logic is called from drq2varlist. This script gathers all variables that any EC-Earth component could produce, and then runs all of them through the preference function that determines whether to keep it or not. This procedure is supposed to yield a unique set of variables for all data requests and all EC-Earth configurations.
Whenever you call ece2cmor with the --drq option, it does a drq2varlist first and then a cmorization with the component-wise variable set. It performs a check on the latter to ensure there are no duplicates, because that may give rise to files being overwritten.
BTW whenever calling ece2cmor with --drq option or drq2varlist, it is best to give also a target EC-Earth configuration (use --help to get a list of those), because that can be used to determine the preference and hence reduces the chance of ending up with duplicates.
The duplication of ua, ua7h etc. is a problem because it will cause overwritten variables since the output file names for these variables are identical (see issue #334 ). I believe they have different priorities, and we should decide which ones to keep.
@goord the changes in e0e8dc576098f8a066a36c2088798e00894fcafe so the extension of the prefs.py
does change the json data request files, for a part as expected, but I am also partly surprised by rather long lists of changes.
Hi @treerink the biggest change is the removal of variables for components that are not in the ec-earth configuration. I figured that e,g, AOGCM experiments should not be bothered with duplicates from e.g. land-surface or tm5 right? This will give a lot of removed variables I guess, I would expect entire blocks of component variables to be removed for certain configurations.
Hi @goord,
Ok, that seems indeed the case. I just show one example below, can you check thisdiff _latest_ _previous_
and agree?
71a72
> "evspsblsoi",
116c117,199
< "lpjg": {},
---
> "lpjg": {
> "Amon": [
> "fco2antt",
> "fco2nat"
> ],
> "Emon": [
> "cSoil",
> "mrsol",
> "treeFracNdlDcd",
> "treeFracBdlEvg",
> "treeFracBdlDcd",
> "grassFracC3",
> "grassFracC4",
> "pastureFracC3",
> "pastureFracC4",
> "nep",
> "fLuc",
> "cWood",
> "nwdFracLut",
> "fracLut",
> "vegFrac",
> "treeFracNdlEvg",
> "cropFracC3",
> "cropFracC4"
> ],
> "Eyr": [
> "treeFrac",
> "grassFrac",
> "shrubFrac",
> "cropFrac",
> "vegFrac",
> "baresoilFrac",
> "fracOutLut",
> "fracInLut",
> "fracLut"
> ],
> "Lmon": [
> "mrsos",
> "mrso",
> "mrros",
> "mrro",
> "prveg",
> "evspsblveg",
> "evspsblsoi",
> "tran",
> "tsl",
> "treeFrac",
> "grassFrac",
> "shrubFrac",
> "cropFrac",
> "pastureFrac",
> "baresoilFrac",
> "residualFrac",
> "cVeg",
> "cLitter",
> "cProduct",
> "lai",
> "gpp",
> "ra",
> "npp",
> "rh",
> "fFire",
> "fGrazing",
> "fHarvest",
> "nbp",
> "fVegLitter",
> "fLitterSoil",
> "cLeaf",
> "cRoot",
> "cCwd",
> "cLitterAbove",
> "cLitterBelow",
> "cSoilFast",
> "cSoilMedium",
> "cSoilSlow",
> "landCoverFrac",
> "rGrowth",
> "rMaint"
> ],
> "day": [
> "mrso"
> ]
> },
282c365,378
< "tm5": {}
---
> "tm5": {
> "AERmon": [
> "abs550aer",
> "od550aer"
> ],
> "Amon": [
> "o3",
> "o3Clim",
> "ch4",
> "ch4Clim",
> "ch4global",
> "ch4globalClim"
> ]
> }
So this is for the AOGCM configuration I assume? Yes evspsblsoi
was removed from the ifs parameters (Andrea pointed out it cannot be produced by ifs) and the other ones are not in the AOGCM configuration, so I expect them to be gone.
So @ufladrich and @tommibergman if you run drq2varlist or ece2cmor with the --drq option and you don't want to be bothered with duplicates from other submodels than your targeted EC-Earth configuration, you have to provide your configuration, e.g.
ece2varlist --drq <something.xlsx> --ececonf EC-EARTH-AOGCM
to remove all variables not in ifs or nemo.
@treerink I removed tsl from lpjguess in the prefs.py and fixed a bug concerning EC-EARTH-CC so you may want to regenerate the json files...
I had --ececonf EC-EARTH-Veg
in my earlier tests.
Done, the current latest version of the control output files in the r6705-control-output-files
branch do contain these changes.
I think we can (nearly) close this issue.
The only sub issue I am not sure whether it is solved by now is this one about "duplication of ua, ua7h etc. which is a problem because variables will be overwritten".
A separate issue is created in #422 for the last sub issue mentioned above.
Closing this issue.
With 'EC-Earth CMIP6 data request' I mean the subset of CMIP6 requested variables for a certain MIP experiment which indeed can be produced by EC-Earth3.
If this 'EC-Earth CMIP6 data request' is written to a json file it can be easily used as the data request file at time of cmorization, it can be easily diffed and it can be copied in the namelist subdir of each MIP experiment and thus archived at the EC-Earth svn repository. The latter wouldn't be a good idea with the *.xlsx data request files.