What about double-counting variables

goord commented 6 years ago

It may happen that variables can be produced by more than one component (especially in the case of tm5-ifs or lpjg-ifs). We should come up with a mechanism to give precedence to certain models for certain variables.

goord commented 6 years ago

Currently, a prioritization is made based upon the realms (see output of taskloader)

tommibergman commented 6 years ago

For the TM5-IFS part, mostly we would like to have precedence with TM5 for tables AER and IFS for tables A. I am sure there are few exceptions, but this would be a first order suggestion.

Precedence of IFS over TM5 is true especially for the meteorological variables (these are mainly in Amon), since anyone using the data can always regrid to lower resolution.

tommibergman commented 5 years ago

We decided to produce a file with double counting variables with rules on which component should in which case produce the variable. Format is variable name, table, components in list of preferred order. So for example a line pfull AERmon [tm5,ifs] would mean that pfull variable for AERmon table would be produced from tm5 if tm5 is present, if not then ifs.

Actually the table column could also be a list, since more than one table but not all can have same preference. Or what do others think?

Attached is a list for TM5 double-counting.txt

goord commented 5 years ago

It should also be noted that the user will have to give the 'model configuration' (i.e. list of components) that has produced the data, even though one is only cmorizing variables for one component at the time...

goord commented 5 years ago

Hi @tommibergman and @treerink after some thought I came to the following conclusion: it may be more appropriate to write a separate script that splits the input data request into json variable list files according to EC-Earth component. In this way, it becomes more traceable and transparent which variables are being produced by which component, it can even be archived or put under version control with the model configuration files. This script will of course make use of the preferences file proposed above.

treerink commented 5 years ago

@goord would the same idea possible but then with these component json files again merged in one json file in the end for each mip experiment for a given ece model configuration? This makes the archiving more compact, but also the cmorisation more straight forward, because otherwise one has to specify several jsons and pick the right ones when cmorising. Or does this break your idea?

goord commented 5 years ago

@treerink we can also make a single json file with an extra level denoting the components, e.g.

[ 
 "ifs": [ "Amon": ["ua", "va", "tos"]],
 "nemo": ["Omon": ["sos", "tos"], "3hr": ["tos"]]
]

If one specifies such a json, it can be crystal clear for the task loader and the user which variables will be omitted when processing for a single component.

treerink commented 5 years ago

Yes, sounds like a plan. So let's try this for #253.

zklaus commented 5 years ago

This plan of having one json file for each job sounds good!

But I'd like to comment a bit on what a job is: First, the use of MIP together with Experiment is seems to me to be quite misplaced. The two are separate entities and there is not a particularly strong connection. Indeed, I think we can and should essentially ignore MIP now that the experiments are designed.

Wrt "model configuration", in common parlance this does not refer merely to a collection of components, but to what are separate models from the point of view of cmip6, eg EC-Earth-Veg, EC-Earth-CC, etc.

These two things are the only two that we should consider for organizing the json files. This would give us a directory structure like

<model configuration>/<Experiment>/data-request.json

eg

EC-Earth-Veg/piControl/data-request.json

The data-request.json file should contain all variables requested by all mips, ie it should be based on the file cmvme_ae.c4.cd.cf.cm.co.da.dc.dy.fa.ge.gm.hi.is.ls.lu.om.pa.pm.rf.sc.si.vi.vo_<Experiment>_3_3.xlsx produced with the -m _all_ switch to drq.

Do you agree?

treerink commented 5 years ago

@zklaus actually I was not after setting up any new directory infrastructure for this data request json files, the idea is just to add them in the existing control output sub directories so they form a set with the control output files for each experiment.

zklaus commented 5 years ago

@treerink fair enough, that should work! What do you think about the -m all thing?

treerink commented 5 years ago

Well each experiment has its own data request, in some cases (the Core MIP cases) this is a joined data request because we want to be efficient in running the experiment only once for all the MIPs run by a certain model configuration (EC-Earth3-AOGCM, EC-Earth3-Veg etc.). But genecec accounts for all of this (all this data request files are already generated to produce the control output files, but I did not share them because they are xlsx files) and as soon we have added the automatic creation of the json data request files this will be all ready for the end user. Note that the only content-wise difference between the xlsx data request files and the json data request files will be that the json ones do not contain variables which are requested but which can not be produced by EC-Earth, the ones in the ece2cmor3 ignored list (#253).

As a cmorizer you don't need anything with drq -m _all_. In fact in the identification steps on the background in ece2cmor3 I do use such things, but I noted that "all" in the python dreq package is different from "all" in drq, so I actually prefer to explicitly list the MIPs which I need to include.

In fact it would be also useful to generate for each experiment a metadata template file and add those as well to the control output sub directories, several MIP, experiment depending variables could be set by genecec, but the cmorizer always has to modify some stuff like the ensemble member label, that is why they always will be called metadata file templates (#214). I am only not sure whether this will be ready in time for the Core MIP cmorization.

treerink commented 5 years ago

By the way an example how to create in the current situation an xlsx data request file (as long the json data request files are not there) is given at the step by step wiki.

klauswyser commented 5 years ago

Please have a look at the newly created issue 615 on the EC-Earth dev portal.

The problem are not only the varlists for the different MIPs that are run with the same model configuration, but it's also the activity_id that is given by the MIP. Most experiments belong uniquely to one MIP so this is not a problem, but what to do with the "historical" experiment? How do we make sure that the variables are saved correctly for each MIP?

treerink commented 5 years ago

Ok, concerning joined Core MIP experiments, your point is that at the time of cmorising you actually don't want to provide a joined cmorised set of variables, but now you want to split out for each MIP the requested list of variables by this MIP experiment and then provide the correct activity_id which then becomes obvious (though one of the things the cmorizer has to adjust in the metadata template). You might be right that this is the way we have to provide the cmorized data, it will be a painfully amount of identical data with just slightly different meta data. Anyway, it means that I then have to produce with genecec for each joined Core MIP experiment a set of json data request files for each MIP one (which is in itself not to difficult I think).

klauswyser commented 5 years ago

it will be a painfully amount of identical data with just slightly different meta data.

Are you sure about that? Do you think the same variable is in the drq for say SIMIP and CMIP? It would be nice if only variables that are exclusively in SIMIP are processed when running ece2cmor with activity_id=SIMIP, but I don't know if this is the case. Otherwise you are right, and the amount of duplicates would be prohibitive. In that case it would be better to process everything with activity_id=CMIP and then just hope that data users find the data that were produced for SIMIP.

it means that I then have to produce with genecec for each joined Core MIP experiment a set of json data request files for each MIP one (which is in itself not to difficult I think).

That could be a reason to not produce json files but stick with the xls files that are produced by drq, or?

zklaus commented 5 years ago

I was also pondering these issues, but I have come to the conclusion that the activity_id in the metadata-template.json is always the mip that "owns" the experiment, not the one that requested the variable. This is not clearly spelled out in the CMIP6 documents, the evidence is circumstantial, but substantial. A lot of it comes from [1]:

In [1, Table 3] it is specified that the global attribute activity_id comes from CMIP6_experiment_id.json. In there, activity_id only lists the owners of the experiment.
Looking at the data that is already published on the ESGF, spot checks suggest that this is the reading of the standard by other groups.
In [1] the filename template does not contain the mip, suggesting that files from the same experiment will not be assigned to different mips. This is also supported by the directory template where the mip appears only above the experiment, never below.

There could be a few more of these hints; I didn't find anything supporting the reading that the files should carry the mip that requested the variable. If you don't find this convincing, let me know and I will hunt some more evidence. Otherwise we can seek clarification from Karl Taylor or maybe any other of the others of [1].

In the case that actually multiple mips are relevant, [1, Table 1, activity_id row] and [1, Table 1, footnote 3] specify that the mips should be listed together, separated by a single space.

This seems to be applicable only in the case of jointly owned experiment, the complete list of these is:

piClim-aer
piClim-control
ssp370
land-hist
dcppC-forecast-addPinatubo

zklaus commented 5 years ago

@treerink wrt the drq -m _all_ business, you write that a cmorizer I don't need to deal with that, but the step-by-step process that you link, seems to suggest that I do need to do something like

drq -m CMIP,DCPP,LS3MIP,PAMIP,RFMIP,ScenarioMIP,VolMIP,CORDEX,DynVar,SIMIP,VIACSAB -e piControl -t 1 -p 1 --xls --xlsDir ece2cmor3/scripts/cmip6-data-request/cmip6-data-request-m=CMIP.DCPP.LS3MIP.PAMIP.RFMIP.ScenarioMIP.VolMIP.CORDEX.DynVar.SIMIP.VIACSAB-e=piControl-t=1-p=1

Is this correct? In other words, you don't mean that I don't have to do the drq, but just that I can list the applicable mips instead of _all_, right?

goord commented 5 years ago

Ok there are 2 discussions going on here:

The problem with variables being generated by multiple components should be resolved by a 'drq2varlist` script that calls the 'complex' (current) taskloader and classifies the variables according to the preferred submodel as well. The ece2cmor3 taskloader will become straightforward without hidden decisions: what you see is what you get, and if you request something that doesn't exist it will report an error or maybe even abort the entire cmorization process...
The problem with the activity_id for experiments serving multiple MIPS, that is something that should have been decided on by the CMIP6 data request or CMOR people. Maybe we can even give multiple activity id's in the metadata Klaus? We should definitively raise this question to the WRCP people.

zklaus commented 5 years ago

@goord you are right that we kind of derailed the original discussion which was about the same variable being available from different ec-earth components.

But wrt to your second point, I think the situation is clear enough: The activity_id has to be the mip owning the experiment, not the one requesting the variable. In the five experiments that share custody between two mips, both must be listed in the metadata, separated by a single space and @ufladrich informs me that the directory component should be the first mip listed in CMIP6_experiment_id.json.

ufladrich commented 5 years ago

[...] In the case that actually multiple mips are relevant, [1, Table 1, activity_id row] and [1, Table 1, footnote 3] specify that the mips should be listed together, separated by a single space.

And in that case the same reference details on page 17 for the Directory structure template:

If multiple activities are listed in the global attribute, the first one is used in the directory structure.

treerink commented 5 years ago

Is this correct? In other words, you don't mean that I don't have to do the drq, but just that I can list the applicable mips instead of _all_, right?

Yes correct, you need to run drq to create the data request file for the cmorization as long it isn't provided by us.

zklaus commented 5 years ago

Ok, in that case it seems to be a good idea to go with -m _all_. Advantages include

No need to figure out the applicable mips as intersection of mips requesting variables for the given experiment and mips EC-Earth is participating in
No risk to accidentally add or omit the wrong mip
The same filename for all cmorizations, simplifying scripting

So all in all makes the job of the cmorizer much easier.

Are there any disadvantages that I am overlooking?

treerink commented 5 years ago

Using one data request file including all is indeed a pragmatic option, it will cause a lot of error messages because you are asking to cmorise many variables which are not in your data set (and this will differ per experiment). So you loose a bit of control, i.e. if a variable which should have been produced is for whatever reason not in your data set this error message is hard to distinguish , at the other hand, yes it is quite a short cut.

treerink commented 5 years ago

As described in #253 we aim for json data request files which are based on the xlsx file as created by drq, but with the ignored list applied on top and with directly applying the EC-Earth3 model configuration dependent preferences.

treerink commented 5 years ago

Note that the preference file might contain a key as "omit". For instance the chemical tracers CFC12, cfc13, sf6 and c14 as discussed in the ece portal issue 609-26 will be only cmorised for the EC-Earth3-CC configuration.

aearamos commented 5 years ago

Hi @treerink I'm trying have zg500 daily from cmorisation, but I can only find it in table AERday, with a "modeling_realm" = aerosol. So, I'm assuming this is from tm5, right?

Is there a way to cmorise zg500 assuming it comes from ifs, or it falls in this "desired" list we're creating in this issue? I think this is a good candidate for double-counting variables.

Thanks!

tommibergman commented 5 years ago

Yes aerosol realm is mainly from TM5 but there are exceptions. I agree also that this is one for the desired list.

treerink commented 5 years ago

@aearamos So if TM5 is not active in the used model configuration (for instance for EC-EARTH3-AOGCM) you want zg500 from IFS? Do you have an grib code (or expression) already for it? If so, I can add it to the preference file in the dedicated branch we have for it now.

aearamos commented 5 years ago

Yes, I'd want zg500 daily from IFS. I ran a test using a modified version of the AERday table, where I changed from aerosol to atmo the modelin_realm variable. It worked for me, but I know it's not ideal because it'a not the right table. Would it be possible to add this variable to some other table? It is requested by DCPP and we'll start the runs pretty soon.

aearamos commented 5 years ago

Hi @treerink When testing this new branch, I used a simple varlist.json that was in resources and got the following error for all the tables: ERROR:ece2cmor3.taskloader: Cannot interpret day as an EC-Earth model component

Can you provide a varlist in the model that we can use to test? Or how can we generate the varlists now?

Thanks

goord commented 5 years ago

Hi Arthur, there is a script drq2vars that does that

aearamos commented 5 years ago

I'm just checking that. When I do: ./drq2varlist.py --drq cmvmm_DCPP_TOTAL_1_1.xlsx

I get 2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables data_specs_version : 01.00.29 2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables cmor_version : 3.4 2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables Conventions : CF-1.7 CMIP-6.2 2019-03-15 19:49:14 INFO:ece2cmor3.cmor_target: CMOR tables table_date : 08 March 2019 Traceback (most recent call last): File "./drq2varlist.py", line 38, in main() File "./drq2varlist.py", line 34, in main json.dump(result, ofile, indent=4, separators=(',', ': '), sort_keys=True) File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/init.py", line 189, in dump for chunk in iterable: File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 431, in _iterencode for chunk in _iterencode_list(o, _current_indent_level): File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 332, in _iterencode_list for chunk in chunks: File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict for chunk in chunks: File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 332, in _iterencode_list for chunk in chunks: File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 442, in _iterencode o = _default(o) File "/shared/earth/software/Python/2.7.9-foss-2015a/lib/python2.7/json/encoder.py", line 184, in default raise TypeError(repr(o) + " is not JSON serializable") TypeError: <ece2cmor3.cmor_target.cmor_target object at 0x7fea5f48e910> is not JSON serializable

Do I have to add some more flags?

goord commented 5 years ago

Hmm not sure that seems like a bug in the branch. You could use the data request Excel file with ece2cmor, but you have to use it with the --drq option

aearamos commented 5 years ago

Could this be because of the version of CMOR? I'm using CMOR/3.3.3 now.

goord commented 5 years ago

No I think drq2vars is broken in the branch

treerink commented 5 years ago

Yes drq2vars.py is broken, also in the master where it has been merged in. If working again, an example of calling it is:

./drq2varlist.py --drq cmip6-data-request/cmip6-data-request-m=CMIP-e=CMIP-t=1-p=1/cmvme_CMIP_piControl_1_1.xlsxv --ececonf EC-EARTH-AOGCM

And indeed the cmorisation itself is also broken in the master, we aim to have it fixed all next Wednesday. If working again, an example of calling it is:

ece2cmor cmip6-ec-earth-output/t306/001/ --exp t306 --nemo --conf ece2cmor3/resources/metadata-templates/cmip6-CMIP-piControl-metadata-template.json --drq ece2cmor3/scripts/cmip6-data-request/cmip6-data-request-m=CMIP.DCPP.LS3MIP.PAMIP.RFMIP.ScenarioMIP.VolMIP.CORDEX.DynVar.SIMIP.VIACSAB-e=piControl-t=1-p=1/cmvme_cm.co.dc.dy.ls.pa.rf.sc.si.vi.vo_piControl_1_1.xlsx --ececonf EC-EARTH-AOGCM --odir cmor-nemo-CMIP-piControl-AOGCM-306 >& log-cmip6-cmorizing-nemo-cmor-CMIP-piControl-AOGCM-t306 &

aearamos commented 5 years ago

Hi @treerink I'm trying have zg500 daily from cmorisation, but I can only find it in table AERday, with a "modeling_realm" = aerosol. So, I'm assuming this is from tm5, right?

Is there a way to cmorise zg500 assuming it comes from ifs, or it falls in this "desired" list we're creating in this issue? I think this is a good candidate for double-counting variables.

Thanks!

So, regarding this variable (zg500) from table AERday, by using the new varlist files, I should then have the modeling_realm as "aerosol" and the variable will be cmorised as an ifs variable? ece2cmor will be able to cmorise it even though the realm doesn't match one of ifs expected realms? In this case I'd be only using ifs.

goord commented 5 years ago

Hi @aearamos we can add it to ifspar.json. After speaking to @tommibergman , it looks like we will let TM5 generate the model-level meteorological variables (u, v, t, zg, w) in the AER* tables and IFS the rest (such as zg500, which is on pressure levels).

goord commented 5 years ago

Yes zg500 will be cmorized, regardless of the realms, they have no role anymore in the new task loading strategy.

treerink commented 5 years ago

Closing this issue.

EC-Earth / ece2cmor3

What about double-counting variables #224