Check data request priority rules

aearamos commented 5 years ago

Hi everyone,

I just updated my tables and drq (1.00.27) and ran some tests to generate the ppt and xml files for different MIPS.

When running ./generate-ec-earth-namelists.sh CMIP piControl 1 1 I get the following error right after drq2ppt:

Traceback (most recent call last): File "./drq2ppt.py", line 172, in <module> main() File "./drq2ppt.py", line 162, in main taskloader.load_targets(args.vars, active_components={"ifs": True, "nemo": False}) File "/home/Earth/aamaral/cmorize/ece2cmor3/ece2cmor3/taskloader.py", line 55, in load_targets targetlist = load_targets_excel(varlist) File "/home/Earth/aamaral/cmorize/ece2cmor3/ece2cmor3/taskloader.py", line 129, in load_targets_excel priority_index = row.index(priority_colname) ValueError: 'Priority' is not in list

After some debugging, I noticed that whenever drq2ppt calls the ./drq2ppt.py --vars cmip6-data-request/cmip6-data-request-m=CMIP-e=piControl-t=1-p=1/cmvmm_CMIP_TOTAL_1_1.xlsx table, one of the labels in the table is "Default Priority" and not "Priority", which caused the error. I changed the labels in each tab by hand and it generated the ppt and xml files just fine.

How can this issue be fixed? I only noticed it now and didn't have this problem before.

treerink commented 5 years ago

@aearamos Indeed this error occurs when updating to drq version 1.00.27, currently ece2cmor uses version 1.00.26 (So you were a bit too fast). However I will try to update ece2cmor3 soon. The easy fix and hopefully correct fix is by changing in taskloader.py

priority_colname = "Priority"

by

priority_colname = "Default Priority"

I am only a bit puzzled by the drq 01.00.27 release notes:

improved spreadsheets provided on web site: the "cmvme" tables have a priority in column one: previously this was the CMOR variable default priority, which caused some confusion. Now changed to be the priority set by the requesting MIP for that experiment.

aearamos commented 5 years ago

@treerink That's how I fixed my program and generated the files. Thanks!

About their issue, as I pointed out before, the same variable can have two different priorities depending on the MIP. e.g. I think salinity has a Default Priority of 1, but for DCPP its priority is 2. I think that's why it's different.

I also noticed that in generate-ec-earth-namelists.sh, the drq2ppt uses the cmvmm_TOTAL file, while drq2file_def-nemo uses the cmvme_experiment file. I'm saying this because the first (cmvmm_TOTAL) has Default_Priority and the second (cmvme_experiment) has Priority in it. One fix is to use the same excel file for both functions and use the name in taskloader.py accordingly. I think the result will be the same, right? Thanks

treerink commented 5 years ago

Hi, I posted a question at their open issue, as the strategy to identify all variables relies on this. So I hope it will become really clear. This issue is separate for some reason but related to this closed one.

zklaus commented 5 years ago

The original issue here (changed name across drq versions) seems to be resolved. Shall we close this?

treerink commented 5 years ago

Actually a kept it open because it is rather relevant to have an answer from the data request people.

I will however change the subject, as I am myself also everytime looking why this one is not closed.

zklaus commented 5 years ago

Hi Thomas,

I see. Since that ticket hasn't gotten any response since August last year, and indeed no response to your comment at all, let's see if we can update the question with the developments since then. Next I will then prod Martin about it again. Let's see if I understand your questions correctly. Summarizing your question over there:

The "aggregated spreadsheet" are the files labeled by cmvmm_, correct?

I don't know, but the cmvme files also seem to carry aggregated information.

The column labeled "Default Priority" is a kind of default priority of a certain variable for this MIP if I understand correctly. This priority however can differ within a MIP for different experiments I understand.

Ok, couple of things to unpack. When talking about variables in the data request, we have to distinguish at least three different entities in the dreqML: MIP Variable [var], CMOR Variable [CMORvar], and Request variable [requestVar]. None of them have directly anything to do with the netcdf files, ie variable names; this information comes later from the cmip tables.

vars are very general. They really only fix cf standard name and units, and give some textual information. Despite the name (MIP Variable) they are not connected to any mip, except that they might give a textual clue about which mip came up with them originally, which is, in general, not an activity in the sense of CMIP6 controlled vocabulary, but can be something like CMIP5. They are also independent of tables.
Next we have CMORVars. They link to a var and add a lot of information. At this level we find the connection with the tables, the grids (both spatial and temporal), crucial processing instructions (like here), and the defaultPriority. Note that this is simply the priority that was deemed appropriate by the creator of the entry. At this level there is no connection to activities (vulgo mips) yet, and there is no aggregation of their priorities going on.
Finally we have requestVars. These link the CMORvars with mips and requestVarGroups which are collections of variables that the mip thinks are connected. On top of that requestVars also give the priority that the mip thinks this variable should have in this requestVarGroup.

What does it all mean? Well, I think the concept of default priority is independent of mip. Variables have default priorities before they are assigned to mips. The good news is that default priority does not vary with mips. It simply is an indicator of how important the variable overall has been considered by someone. On the other hand we have the concept of priorities within requestVarGroups (and yes, the same variable in the same table can have different priorities in different groups within the same mip, checkout (Lmon, baresoilFrac)).

Noting further that the cmvmm files talk of Default Priority with a comment of

Default priority (generally overridden by settings in "requestVar" record)

whereas the cmvme files talk about Priority with a comment of

Lowest priority value set in request for this variable for this experiment

seems to suggest that the cmvme files carry the more relevant aggregate priority.

I hope/expect then that the default priority gives the highest occurring priority for this certain variable which is encountered among the experiments within one MIP (where 1 is the highest priority and 3 is the lowest priority), is this correct?

No, see eg c4PftFrac. I can't say for sure if this is intentional, but it might be and is almost certainly going to occur for some variable. As said above: Default priority as a concept is not applicable within a mip.

If this is not the case, I at least hope that the variables ending up in this cmvmm_ files are selected on this criterion?

Yes, this seems to be the case.

I have often seen higher numbers (2 and 3) in the "Priority" column in the cmvmm_ files in data request up to 01.00.26 while I requested for priority 1.

Probably that changed at some point (at 01.00.27?) in the sense that the cmvmm files don't contain a Priority column anymore; only a Default Priority column.

At SMHI we use the cmvme_ae.c4.cd.cf.cm.co.da.dc.dy.fa.ge.gm.hi.is.ls.lu.om.pa.pm.rf.sc.si.vi.vo_historical_1_1.xlsx files at the moment, and all things considered, I don't see a reason to change that. However, I also don't see why the cmvmm files would be any worse. The requestVol files are certainly not useful for this purpose.

I hope this long dribble helps a bit in the clarification. Cheers Klaus

zklaus commented 5 years ago

After a good nights sleep and studying again the dreqPy documentation, 5.8 it seems clear that the cmvmm files are aggregating by mip, whereas the cmvme files are aggregating by experiment. I think we want the latter because we want to cmorize the output of one experiment when that is finished and not try to cmorize partially a number of experiments according to the involved mips.

For the same reason I suggest to use the cmvme file that contains all the mips, i.e. the long filename mentioned above. Then the last question is about tier and priority.

Do we want to consider also lower tier and priority variables or only 1 and 1?

zklaus commented 5 years ago

It seems we have moved on from this discussion. @treerink you think this could be closed now?

treerink commented 5 years ago

Closing this issue after adding it to the Cold case issues.

EC-Earth / ece2cmor3

Check data request priority rules #206