WCRP-CMIP / CMIP6_CVs

Controlled Vocabularies (CVs) for use in CMIP6
Creative Commons Attribution 4.0 International
157 stars 79 forks source link

Dealing with experiments with multiple sponsoring MIPs #951

Closed durack1 closed 3 years ago

durack1 commented 4 years ago

The discussion in #937 has diverged from the issue title and is now focused on better clarifying how modeling groups need to identify data contributed to an experiment which is a contribution other than the primary sponsoring MIP.

To attempt to keep discussions contained in a suitably identified discussion, I have opened this new issue with an appropriate title. @MartinaSt @nedavid @taylor13 @wachsylon @sjmarsland @martinjuckes @sol1105 pinging you here for discussion continuity.

The example which led to #937 being submitted was the ssp370 experiment which is sponsored by ScenarioMIP, AerChemMIP.

Just for context, in the experiment_id CV of the CMIP6_CVs, the number of experiments that list more than a single activity include:

experiment_id activity_id (primary) activity_id (additional)
dcppC-forecast-addPinatubo DCPP VolMIP
esm-1pct-brch-1000PgC C4MIP CDRMIP
esm-1pct-brch-2000PgC C4MIP CDRMIP
esm-1pct-brch-750PgC C4MIP CDRMIP
esm-1pctCO2 C4MIP CDRMIP
esm-bell-1000PgC C4MIP CDRMIP
esm-bell-2000PgC C4MIP CDRMIP
esm-bell-750PgC C4MIP CDRMIP
land-hist LS3MIP LUMIP
piClim-aer RFMIP AerChemMIP
piClim-control RFMIP AerChemMIP
ssp370 ScenarioMIP AerChemMIP

For reference, https://github.com/WCRP-CMIP/CMIP6_CVs/issues/937#issuecomment-644243564 is where discussion about this separate issue begins. I will close #937 as that particular issue has been resolved

taylor13 commented 4 years ago

The above is not really a statement of an "issue"; it's simply extracting the special cases from the CV. The issue(s) relate apparently to confusion resulting from improper handling or interpretation of these special cases. I'll try to enumerate these issues:

  1. The modeling groups have recorded their aspiration to participate in a MIP in their entry labeled activity_participation in the CMIP6_source_id.json
  2. Even though the activities they listed there were originally not meant to be definitive, the citation service relies on activity_participation in generating a potential list of citations that is needed once data have been published before the service can actually automatically create a citation. [@MartinaST please check that I've described this correctly.] So, as I understand it, if the primary sponsor of an experiment (i.e., the MIP that is supposed to be recorded first in activity_id in the file, and as activity_drs on ESGF, and used in generating directory structures, and communicating throughout the infrastructure) has not been included in the activity_participation list, then a citation cannot be created.
  3. [In the following sentence a correction was made 8/20/20 in which the following phrase was added: "Unless PrePARE is included as part of the publication procedure".] Unless PrePARE is included as part of the publication procedure, the ESGF publisher does not check, apparently, whether the first MIP listed in the activity_id is consistent with the first MIP listed for the experiment_id CV; it just checks that the activity_id value(s) correspond with those found in the experiment_id CV (and if only one of multiple values is found, the data is published anyway). This means data (not written through CMOR, which enforces the correct number of entries and their ordering) can get published with the wrong activity_drs value. In the future, I think we should guard against this happening. Also some data providers think that if they are not participating in the MIP that is the primary sponsor of an experiment, then they should publish the data using activity_id (and activity_drs) set to the secondary MIP (which they do participate in). This is not correct. They need to publish their experiment under the primary sponsoring MIP even if it isn't included in their list of activity_participation. I think we may have steered @nedavid wrong in https://github.com/WCRP-CMIP/CMIP6_CVs/issues/937#issuecomment-629993922 .
  4. The use of activity_id in the ESGF search is meant to allow users to narrow consideration to only the experiments that are sponsored (or co-sponsored) by an activity. It is not meant to allow users to narrow consideration of models that are participating in a particular MIP. But it is understandable that some have misinterpreted this search facet. In any case if a dataset has properly recorded all the activities responsible for an experiment in the activity_id, then these datasets will be found searching for any of the activities (even though only the first activity is used as part of the DRS).

In summary,

  1. no one should rely on the activity_participation for definitive information. Modeling groups may indicate in activity_participation that they are participating in a MIP that is a secondary sponsor of a particular experiment, but that same group must publish the data under the activity designated as the primary sponsor. If I understand the present state of things, this somehow prevents the citation service from producing a citation. Is that true? If so, the proper remedy is to 1) make sure the data lists the primary MIP sponsor first in the global attribute activity_id, and 2) ask the modeling group to add the primary sponsors name to their activity_participation list, even though they may not really be participating in that activity. Alternatively, the citation service could be altered so that it doesn't consult activity_participation for the purpose of creating citations.

  2. [Correction made on 8/20/20: PrePARE does in fact check the ordering, so the following statement is wrong. See https://github.com/WCRP-CMIP/CMIP6_CVs/issues/951#issuecomment-677730580.] In the global attribute activity_id the string can be a list of activities and it was meant to be a list in the same order as it appears in activity_id of the experiment_id CV. CMOR ensures that it will be, but PrePARE and the ESGF publisher do not enforce the ordering, so some data has gotten published with the wrong activity being recorded in activity_drs (and presumably in the directory structure and elsewhere in the CMIP6 infrastructure). PrePARE and/or the publisher should implement a check that all (co-)sponsoring activities are listed in the correct order in activity_id. The data from all models that have run a given experiment will be found regardless of which sponsoring MIP one filters against. [We can consider whether to change the approach for CMIP7.]

MartinaSt commented 4 years ago

@taylor13 Thanks for writing this up. Two comments:

No, the citation service cannot be altered this way, as it checks against master_id and not activity. It relies on the unique ESGF master_id or the DRS_id of the data to connect the citation to the right data collection. This DRS_id makes the citation independent of the current portal solution, which is essential on the long-term. These citation entries with their DRS_ids are defined by the modeling center in their registration of activity_participations for a source_id. The modeling center prepares the data for the registered activities and publishes them. The citation connects its metadata using the unique ESGF master_id (DRS_id) The master_id is also used to create the citation_url in the ESGF index. If a modeling center decides to prepare its data for an activity, they do not intend to participate in (in terms of global attributes in the files and DRS), the ESGF publishes the data with these wrong master_ids, for which no citation entry exists (and according to the modeling center's registration should not exist).

taylor13 commented 4 years ago

@MartinaSt @martinjuckes @durack1 @sashakames For CMIP6, will the following solution work for ES-DOC, CMOR, ESGF, data citation services, the data request, the CV's etc. (at least going forward)?

  1. For any given experiment, the ESGF publisher (perhaps relying on PrePARE?) should check that the correct primary activity (defined to be the activity that is listed first in the experiment_id CV under "activity_id") has been recorded by the data provider in the file's activity_id. If it doesn't, we could
    • reject the file as being non-compliant (my choice), or
    • accept the file, but in the ESGF catalog enter a corrected the activity_id and activity_drs, consistent with the experiment_id CV.

If the activity_id contains entries that are ordered incorrectly, but otherwise consistent with the list in the experiment_id CV, we could

  1. The citation service should invariably associate an experiment with the activity that is the primary sponsor of the experiment (i.e., listed as the first activity for that experiment in the experiment_id CV). If the citation service relies on activity_participation to generate citations, then we would need to insist that the modeling group list in activity_participation all activities that are the "primary" activities for experiments they have performed (whether or not they actually are carrying out all the tier-1 experiments called for by that activity).

A question is: how difficult would it be for the ESGF publisher to implement the above (i.e., check and possibly correct activity_id?

As I understand it, nothing would have to be done by @MartinaSt , but how would we (she) find out that a group had failed to include a needed primary activity in their activity_participation list? Would ESGF throw an error? Would Martina receive some sort of notice that her service was unable to associate a citation with a data set? or what?

Please say if the above won't work and why.
thanks.

wachsylon commented 4 years ago

If the activity_id contains entries that are ordered incorrectly, but otherwise consistent with the list in the experiment_id CV, we could

  • accept the file, but in the ESGF catalog, set activity_drs to the first activity listed for the experiment in the experiment_id CV, ignoring the ordering that the activities appear in the activity_id global attribute (my choice), or

Does ESGF catalog means that the data path changes? After all the effort to integrate the part of DRS into CMORization, I don't like the idea of moving parts of the path creation to the publisher side.

but how would we (she) find out that a group had failed to include a needed primary activity in their activity_participation list?

Proposal: ESGF Publisher checks it and creates a pull request in this CV which may be confirmed by someone.

I.e., entry for activity_participation would not be part of a registration of a model anymore. It seems to me that at the time of registration, many modeling groups do not know what activities they will support anyway.

durack1 commented 4 years ago

@wachsylon the intention of activity_participation was to collect intentions (to participate in satellite MIPs), and so we don't want to lose that at the point that the model is registered.

I am also not keen on the idea of a new issue being created automagically, rather, an error would be thrown, providing the URL to create a new issue so that whoever is trying to publish data is very aware of what is causing the hold-up and takes ownership over this registration step.

We have been trying to ascertain how much checking we can get the publication step to do, and I think that validation, not registration is the only role of the publisher checks

taylor13 commented 4 years ago

@sashakames I have 2 questions:

  1. Does the ESGF publisher check the directory structure for consistency with the CMIP6 specs? If so, does it check that the "activity" is the same as the first entry in the activity_id global attribute?
  2. When the ESGF publisher attempts to publish a dataset, does it know if @MartinaSt citation service is able to link it to a citation, or does that happen later? If the publisher learns that the citation information is missing, does it currently error exit or what?

thanks.

matthew-mizielinski commented 4 years ago

Hi All, some observations (I've only just seen this conversation);

I've published MOHC data [1] via the CEDA esgf node for the two shared RFMIP-AerChemMIP experiments noted above which have the activity_id AerChemMIP in their dataset ids [2]. If I search for these on ESGF they all have links to file paths within THREDDS that use RFMIP for the activity_id, i.e. some part of ESGF uses the primary activity_id for the construction of directory paths, which is in conflict with the dataset id.

Searching back through the standard documentation (CMIP6 guidance and global attributes) the only reference to this issue I can find is in a bullet point in the global attributes document under the description of the DRS structure. My assumption was that if you haven't registered a model against an activity_id (UKESM1-0-LL wasn't initially registered for submission to RFMIP) that you should not attempt to submit to it.

I'd be tempted to suggest that where a group has published data using the additional activity_id that a "won't fix" errata is issued (it would be a little disruptive to correct this now). From what I can see on ESGF [3] there are three institutes that have done this for the two RFMIP-AerChemMIP experiments; MOHC, BCC and NOAA-GFDL. Similar searches should quickly highlight where else this has happened.

[1] with activity_id attribute set to RFMIP AerChemMIP in each file by CMOR. [2] search for id:CMIP6.AerChemMIP.MOHC.UKESM1-0-LL.piClim-control.* or id:CMIP6.AerChemMIP.MOHC.UKESM1-0-LL.piClim-aer.* on ESGF (typo now removed from experiment_id) [3] search for id:CMIP6.AerChemMIP.*.piClim-control.*

taylor13 commented 4 years ago

@matthew-mizielinski : Thanks for engaging on this. We need your perspective/advice. I'm confused about the terminology being used. From the ESGF page, when I select one of your piClim-control files and ask ESGF to "show metatdata", some of the information shown is:

d = CMIP6.RFMIP.MOHC.HadGEM3-GC31-LL.piClim-control.r1i1p1f3.Amon.hus.gn.v20191113|esgf-data3.ceda.ac.uk
version = 20191113
_timestamp = 2019-11-13T14:59:58.401Z
access = HTTPServer, GridFTP, OPENDAP, Globus
activity_drs = RFMIP
activity_ids = RFMIP, AerChemMIP
cf_standard_name = specific_humidity
citation_url = http://cera-www.dkrz.de/WDCC/meta/CMIP6/CMIP6.RFMIP.MOHC.HadGEM3-GC31-LL.piClim-control.r1i1p1f3.Amon.hus.gn.v20191113.json
data_node = esgf-data3.ceda.ac.uk
data_specs_version = 01.00.29
dataset_id_template_ = %(mip_era)s.%(activity_drs)s.%(institution_id)s.%(source_id)s.%(experiment_id)s.%(member_id)s.%(table_id)s.%(variable_id)s.%(grid_label)s
datetime_start = 1850-01-16T00:00:00Z
datetime_stop = 1879-12-16T00:00:00Z
directory_format_template_ = %(root)s/%(mip_era)s/%(activity_drs)s/%(institution_id)s/%(source_id)s/%(experiment_id)s/%(member_id)s/%(table_id)s/%(variable_id)s/%(grid_label)s/%(version)s
.
.
.
instance_id = CMIP6.RFMIP.MOHC.HadGEM3-GC31-LL.piClim-control.r1i1p1f3.Amon.hus.gn.v20191113
institution_id = MOHC
master_id = CMIP6.RFMIP.MOHC.HadGEM3-GC31-LL.piClim-control.r1i1p1f3.Amon.hus.gn 
.
.
.

The only place I see AerChemMIP is as the second MIP listed in activity_id. I think this is done correctly and is exactly what I hoped to see. You mention a dataset i.d. and as an example give id:CMIP6.AerChemMIP.MOHC.UKESM1-0-LL.piClim-aerl.*. How do you get this i.d.?

To be sure the ESGF search is smart enough to find piClim-aerl experiment data under either of the two activities listed in activity_id, but I don't think it uses the second activity in defining any fundamental i.d. To go further, do you know where the dataset i.d., you refer to is stored and what it is used for?

matthew-mizielinski commented 4 years ago

Let me get my terms correct; I'm used to using "dataset id" for what is recorded in ESGF as the instance_id (my fault) and I'll try to correct this forthwith.

Apologies for the typo in the _instanceid/master_id search pattern above (now corrected); it should have included piClim-aer not piClim-aerl. The syntax for searching by this id is from the search page of esgf (under more search options).

The dataset you pick out is from the HadGEM3-GC31-LL "physical" model, which was processed and published with the primary activity_id RFMIP in the instance_id. For our earth system model UKESM1-0-LL things were done differently as it (initially*) wasn't registered in the CVs to contribute to RFMIP, so we used AerChemMIP for the shared experiments that were wanted by the AerChemMIP community.

I'll pick out one dataset from the UKESM1-0-LL submission;

id = CMIP6.AerChemMIP.MOHC.UKESM1-0-LL.piClim-control.r1i1p1f4.AERday.ua10.gn.v20200214|esgf-data3.ceda.ac.uk
version = 20200214
_timestamp = 2020-02-21T17:35:51.735Z
access = HTTPServer, GridFTP, OPENDAP, Globus
activity_drs = AerChemMIP
activity_ids = RFMIP, AerChemMIP
cf_standard_name = eastward_wind
citation_url = http://cera-www.dkrz.de/WDCC/meta/CMIP6/CMIP6.AerChemMIP.MOHC.UKESM1-0-LL.piClim-control.r1i1p1f4.AERday.ua10.gn.v20200214.json
data_node = esgf-data3.ceda.ac.uk
data_specs_version = 01.00.29
dataset_id_template_ = %(mip_era)s.%(activity_drs)s.%(institution_id)s.%(source_id)s.%(experiment_id)s.%(member_id)s.%(table_id)s.%(variable_id)s.%(grid_label)s
.
.
.
instance_id = CMIP6.AerChemMIP.MOHC.UKESM1-0-LL.piClim-control.r1i1p1f4.AERday.ua10.gn.v20200214
institution_id = MOHC
master_id = CMIP6.AerChemMIP.MOHC.UKESM1-0-LL.piClim-control.r1i1p1f4.AERday.ua10.gn
.
.
.

Note that AerChemMIP is used in the id, activity_drs, citation_url and instance_id. If you follow the links to the THREDDS catalogue you'll eventually get to this page which has links to files that use RFMIP in the path name rather than AerChemMIP, e.g.

https://esgf-data3.ceda.ac.uk/thredds/fileServer/esg_cmip6/CMIP6/RFMIP/MOHC/UKESM1-0-LL/piClim-control/r1i1p1f4/AERday/ua10/gn/v20200214/ua10_AERday_UKESM1-0-LL_piClim-control_r1i1p1f4_gn_18500101-18941230.nc

Data is also stored on disk under RFMIP rather than AerChemMIP; files can be found under RFMIP in the CEDA archive, but not under AerChemMIP

So I think what I was trying to communicate is that we correctly have included both activity ids in the file metadata and that THREDDS appears to host the data using the primary activity_id RFMIP, but in all other aspects these datasets have AerChemMIP as their activity_id.

*UKESM1 was registered to take part in RFMIP via #731 a year ago, but we never got round to changing the processing configuration for the piClim-control and piClim-aer experiments as I had already published the first versions using AerChemMIP.

taylor13 commented 4 years ago

Thanks for pointing me to the model where the "problem" is. I'm pretty sure I wouldn't recommend republishing the above dataset, since it seems to be discoverable and citable. The question is, should we (for future publications) guard against publishing datasets with the secondary activity used to define activity_drs, instance_id, master_id, citation_url, and the like.
I thought that the publisher keyed on activity_id, and extracted the first activity to use as activity_drs, which then got subsequently used to define the master_id, instance_id and citation_url. From the example above, however, that doesn't seem to be the case.
Does anyone know how activity_drs gets defined?

MartinaSt commented 4 years ago

@taylor13 Sorry for the delayed answers.

For CMIP6, will the following solution work for ES-DOC, CMOR, ESGF, data citation services, the data request, the CV's etc. (at least going forward)?

  1. For any given experiment, the ESGF publisher (perhaps relying on PrePARE?) should check that the correct primary activity (defined to be the activity that is listed first in the experiment_id CV under "activity_id") has been recorded by the data provider in the file's activity_id. If it doesn't, we could

    reject the file as being non-compliant (my choice), or accept the file, but in the ESGF catalog enter a corrected the activity_id and activity_drs, consistent with the experiment_id CV.

My choice is the same: reject.

  1. If the activity_id contains entries that are ordered incorrectly, but otherwise consistent with the list in the experiment_id CV, we could

    accept the file, but in the ESGF catalog, set activity_drs to the first activity listed for the experiment in the experiment_id CV, ignoring the ordering that the activities appear in the activity_id global attribute (my choice), or reject the file. Both are fine but for the accept option we need to ensure that the activity_drs in instance_id and master_id are also corrected in the ESGF index.

Regarding the relation of ESGF Publisher and citation: The ESGF Publisher constructs the citation_url based on the instance_id, e.g.: instance_id = CMIP6.RFMIP.MPI-M.MPI-ESM1-2-LR.piClim-control.r1i1p1f1.Amon.hur.gn.v20190710 citation_url = http://cera-www.dkrz.de/WDCC/meta/CMIP6/CMIP6.RFMIP.MPI-M.MPI-ESM1-2-LR.piClim-control.r1i1p1f1.Amon.hur.gn.v20190710.json

The citation is independent of the ESGF Publisher. Thus, the correctness of the connection between data and data citation relies on DRS (instance_id and that all its components are registered for this source_id in the CV).

wachsylon commented 4 years ago

There is a plan for a ScenarioMIP community paper which includes analysis of the ssp370 experiment with data of modeling groups that actually do not "want" to participate in ScenarioMIP but AerChemMIP. Related to this, it seems to be important to understand what activity_participation means.

@matthew-mizielinski Should UKESM1-0-LL be for example excluded from RFMIP analysis? In that case i.e. if modeling groups did explicitly not want to be part of other MIP analysis/papers, it would be clear to me why the data should not get citation for ScenarioMIP.

I for myself have had another definition of that term activity_participation

matthew-mizielinski commented 4 years ago

@matthew-mizielinski Should UKESM1-0-LL be for example excluded from RFMIP analysis? In that case i.e. if modeling groups did explicitly not want to be part of other MIP analysis/papers, it would be clear to me why the data should not get citation for ScenarioMIP.

My take on it is that anyone is welcome to analyse any experiment, but that the RFMIP community are principally interested in physical rather than earth system models, so our RFMIP leads are generally only running experiments to support their core analyses (i.e. RFMIP). The UKESM1 simulations submitted for RFMIP were run by our AerChemMIP group to support AerChemMIP activities.

Looking back on the MIPs structure I suspect we will need to think again about shared experiments for CMIP7 (perhaps we should create SharedMIP) and consider the purpose of the activity_participation field here (splitting it into something that means intending to submit and something that means involved in analyses.

taylor13 commented 4 years ago

I agree we should review the treatment of activity_id for CMIP7. For CMIP6,

  1. a single activity owns an experiment (and has responsibility for it). That activity has usually defined the experiment conditions, and for each experiment in the experiment_id CV, this primary activity is listed first. It should also appear first in the global attribute, activity_id.
  2. a second activity may have contributed to an experiment's design and that activity may in fact require the experiment results for its analysis. That activity should be listed (second) in the activity_id global attribute. In contrast and as an exception), although the deck and historical experiments may be essential to many of the activities, onlyCMIP appears under activity_id.

The activity_participation descriptors in the source_id CV were meant to be indications of interest, not as a way of constructing DRS identifiers for CMIP6. It makes some sense for @MartinaSt to construct potential citation references based on the activity_id information in the CV, but it is a mistake to expect that the various ESGF identifiers (e.g., instance_id, master_id) will include the activity_id of interest to a modeling group. The primary activity_id is uniquely determined by the experiment_id, as specified in the the experiment_id CV, and this is the i.d., that should appear in instance_id, master_id, etc.

So I think it is o.k. that @MartinaSt has generated her potential citations using activity_participation, but in communicating with the rest of the infrastructure (mainly the publisher), the primary activity_id should be used, not a group's intention of participating in a MIP.

As someone noted already, activity_id is not needed to uniquely identify a dataset (and it doesn't appear in filenames), so software could ignore activity_id as specified in files and catalogs and indexes, and simply determine the activity_id used by ESGF in creating instance_id, master_id, etc. from the experiment_id CV (i.e., given an experiment_id, simply extract the first item appearing in activity_id in the CMIP6_experiment_id.json dictionary).

taylor13 commented 4 years ago

Some input was provide offline, which I partially copy here so that it doesn't get forgotten:


The HAMMOZ-Consortium has reconfirmed that they want to publish the experiment "ssp370" as contribution to "AerChemMIP" (see https://github.com/WCRP-CMIP/CMIP6_CVs/issues/937). According to the Global Attributes Paper, the primary and DRS-relevant activity is specified first in the global attribute activity_id="AerChemMIP ScenarioMIP".

Currently, PrePare prevents this data from being published in the ESGF.

As it is the wish of the data provider to publish their ssp370 data as part of AerChemMIP, a change of PrePare is required. Any work-around solution outside of PrePare will potentially cause problems in other infrastructure components and on the long-term.

We need a short-term solution for the HAMMOZ-Consortium.

Do you see anything, which speaks against this PrePare change?

taylor13 commented 4 years ago

This is a response to the above (https://github.com/WCRP-CMIP/CMIP6_CVs/issues/951#issuecomment-672393353):


Is the issue the PrePARE software itself (some intrinsic logic), or how the metadata is structured in the CV? Meaning the CV could be updated to accommodate the change (whether of course this is an acceptable change)?

Did you capture the specific error message generated by PrePARE?

taylor13 commented 4 years ago

And a subsequent response:


Right now, the order of the values of activities for experiments with co-sponsoring MIPs is strict in PrePARE/CMOR and must be followed, e.g.: https://github.com/PCMDI/cmip6-cmor-table/blob/c1810416ec67f48dca454b9ea69c5c8dbab856a9/Tables/CMIP6_CV.json#L9567 This is, as far as we can see, not part of the CMIP6 document: https://goo.gl/v1drZl The note for handling multiple acitvity_ids does not strictly determine that there is one primary activity for experiments which cannot be changed.

The value for the activity within the DRS is used to link the data to the citation entry.

Since the first value of the activity_id from the CMOR/PrePARE CV is used to create the DRS, it is not possible to link a citation entry for a MIP listed as "secondary" in the CV with corresponding data because the data is saved under the activity listed first.

We propose that we allow to change the order of the values for activity_id when more than one MIP is defined for any experiment in the CV. This results in data for one experiment being in multiple MIP directories however it better aligns with the wishes and intentions of the data providers.

Is the issue the PrePARE software itself (some intrinsic logic), or

We could change the activity_id values in the CV of CMIP6-CMOR-tables to be a regex that allows both orders and we may need to tell CMOR/PrePARE to interprete it as a regex.

taylor13 commented 4 years ago

I want to correct what I said in https://github.com/WCRP-CMIP/CMIP6_CVs/issues/951#issuecomment-658480977. It turns out that for any experiment, PrePARE will prevent publication of data if the activity_id is different from what is specified in the experiment_id CV. When multiple activities appear, they must be in the correct order. Here is a summary of the tests performed by Chris (@mauzey1):

 I downloaded some of the files from the ssp370 experiment to test them with PrePARE.

I tested a file that had the exact activity_id with PrePARE, and it passed.  I created a new version of this file with the activity_id reversed, which failed when processed by PrePARE.

C Traceback:
! In function: _CV_checkExperiment
!

!!!!!!!!!!!!!!!!!!!!!!!!!
!
! Error: Your input attribute "activity_id" with value
! "AerChemMIP ScenarioMIP" needs to be replaced with value "ScenarioMIP AerChemMIP"
! as defined for experiment_id "ssp370".
!
!  See Control Vocabulary JSON file.(Tables/CMIP6_CV.json)
!
!
!!!!!!!!!!!!!!!!!!!!!!!!!

└──> :: CV FAIL    :: /Users/mauzey1/Desktop/github/cmor/ssp370_test/reversed_activity_ids/emibc_AERmon_CanESM5_ssp370_r15i1p1f1_gn_201501-210012.nc

I then tried another experiment_id=”ssp370” file with activity_id=”ScenarioMIP", which also failed when processed by PrePARE.

C Traceback:
! In function: _CV_checkExperiment
!

!!!!!!!!!!!!!!!!!!!!!!!!!
!
! Error: Your input attribute "activity_id" with value
! "ScenarioMIP" needs to be replaced with value "ScenarioMIP AerChemMIP"
! as defined for experiment_id "ssp370".
!
!  See Control Vocabulary JSON file.(Tables/CMIP6_CV.json)
!
!
!!!!!!!!!!!!!!!!!!!!!!!!!

└──> :: CV FAIL    :: /Users/mauzey1/Desktop/github/cmor/ssp370_test/emibc_AERmon_CESM2-WACCM_ssp370_r1i1p1f1_gn_206501-210012.nc

Yes, both examples will generate an error when processed by PrePARE.
durack1 commented 4 years ago

@taylor13 @mauzey1 I believe this is the correct behavior, so there is nothing to fix on the CMOR/PrePARE end

taylor13 commented 4 years ago

I agree. Will close.

durack1 commented 4 years ago

This thread is relevant for documenting the CMIP6 status, and consideration of changes for CMIP6+:

From: Taylor, Karl E.
Sent: 27 August 2020 05:50
To: david.neubauer
Cc: WIP List; Martin Schupfner; Fabian Wachsmann
Subject: publishing ssp370 experiments on ESGF and linking to citations 

Dear David, 

As co-chair of the WGCM Infrastructure Panel (WIP) supporting CMIP6, I am writing to seek clarification of your reluctance to publish the ssp370 experiment  with identifying descriptors that indicate that the ScenarioMIP activity (MIP) specified the experiment design.   First I want to provide a bit of background, and then I have a couple of specific questions.

There is likely some misunderstanding of the purpose of the data set identifiers used throughout the CMIP6 infrastructure.  These identifiers (based on what is known as the Data Reference Syntax, DRS) are unique collections of descriptors that include the name of the model (i.e. “source”), the institution responsible for running the experiments, the name of the experiment and the activity that has primary responsibility for that experiment.  Driven by scientific investigation, each CMIP activity has designed experiments to meet their research goals, and they have described these in a GMD CMIP paper.  Except for the DECK and historical runs, the experiments designed by an activity will usually be of little interest to other MIPs.  In CMIP6 there are fewer than a dozen cases where a second MIP has enough interest in an experiment that they request that the modeling groups participating in their MIP run that other MIP’s experiment, regardless of their intent to participate in that activity.  In these cases, the activity_id contains the names of both MIPs, with the “owner” of the experiment listed first.  An example of dual interest is experiment ssp370, which was designed by ScenarioMIP (the “owner”) but is also of interest to AerChemMIP.    By including the names of both MIPs in the activity_id, users accessing CMIP data through the ESGF portals can search either for experiments in the ScenarioMIP activity or the AerChemMIP activity, and they will find all the ssp370 simulations.  [They could just as well simply search directly for all the ssp370 experiments and get the same collection.]

The unique identifiers of experiments, which are used in various forms across the entire CMIP6 infrastructure, are simply descriptions of who designed the experiment, what the experiment is, who performed the experiment and what model they used.  These identifiers are not meant to imply anything about a modeling group’s interest in a particular activity.  For example, “CMIP6.ScenarioMIP.BCC.BCC-CSM2-MR.ssp370” indicates that under the CMIP6 umbrella, the ScenarioMIP-designed experiment named ssp370 was performed by the BCC institution using the BCC-CSM2-MR model.  This says nothing about BCC’s interest or lack there-of in ScenarioMIP.  

The citation reads:
Xin, Xiaoge; Wu, Tongwen; Shi, Xueli; Zhang, Fang; Li, Jianglong; Chu, Min; Liu, Qianxia; Yan, Jinghui; Ma, Qiang; Wei, Min (2019). BCC BCC-CSM2MR model output prepared for CMIP6 ScenarioMIP ssp370. Version YYYYMMDD[1].Earth System Grid Federation. https://doi.org/10.22033/ESGF/CMIP6.3035

This accurately reflects the information provided by the data set identifier. “CMIP6 ScenarioMIP ssp370” should be considered the full name of the experiment (i.e. its extended name, including both CMIP6 and ScenarioMIP).  “Prepared for CMIP6 ScenarioMIP ssp370” is meant to be an abbreviation of “produced by a simulation conforming to the CMIP6 ScenarioMIP ssp370 experiment design.”    It does not imply that the modeling group actually participated in ScenarioMIP.

Another aspect of CMIP6 that is also at first difficult to understand is what is meant by the “activity_participation” descriptor that is created when a model is registered.  In registering a model, a group indicates its potential interest in participating in a number of MIPs.  Subsequently, they may add additional names to the list if they decide to participate in some other MIPs.  Sometimes a group’s intention to participate in a particular MIP isn’t realized.  Thus, at any point in time the activity_participation list cannot serve as a commitment or definitive interest in the MIPs listed there. 

The citation service must not slow down the publication process, so the citations for all potential datasets expected in the CMIP6 archive must be created prior to publication of the dataset.  The citation service creates its potential list of citations based on what modeling groups have recorded in its “activity_participation” list.   [It would also not be practical to produce the list assuming all models will perform all experiments.]   If data from an experiment is published but the modeling group has not included in its activity_participation list the activity that “owns” that experiment, then the dataset will not be citable.  This means that modeling groups must include in their activity_participation list all activities that are responsible for the experiments they have performed.  In the case of ssp370, a modeling group must include ScenarioMIP in its activity_participation list.  When this system of creating citations was originally set up, we did not anticipate that there would be more than one activity asking its participants to perform a particular experiment.  If we had, we would have probably changed the name of the descriptor from “activity_participation” to something like “activities_with_primary_responsibility_for_the_experiments_expected_to_be_performed_with_this_model”.   I would note that very few users or contributors to CMIP are even aware of the activity_participation lists, so there should be very little concern that by including ScenarioMIP in this list, anyone might assume the modeling group will actually participate in ScenarioMIP.   It is essential, however, to include ScenarioMIP if you want to publish one of ScenarioMIP’s experiments: for example, ssp370.  If you do not include the activity, then there will be no proper citation or DOI created.  

With the above background (apologies for the length, but CMIP6 is complex and involves lots of infrastructure parts that must communicate efficiently), I have a couple of questions.

1.  Would you still be reluctant to publish your ssp370 datasets and have them identified as being run in conformance with the ScenarioMIP  design?
2.  What would you recommend we do differently for CMIP7?  (in that eventuality).  

Any other thoughts/comments/suggestions would be most welcome.

Best regards,
Karl
MartinaSt commented 4 years ago

This is the view of the group who published ssp370 under AerChemMIP.

Subject:    Re: Questions about BCC ssp370 simulations
Date:   Fri, 28 Aug 2020 10:09:00 +0800 (GMT+08:00)
From:   Dr. Jin Ba
To:     Karl E.
CC:     twwu@cma, shixl@cma, yanjh@cma, wgcm-wip

Dear Karl,

I am the liaison for DOC. About your questions I asked my colleagues Jinghui and Xueli.

The answers are as below,

1. Yes, the documentations were not clear about this. Because the ssp370 experiment was listed in both ScenarioMIP(O'Neill et al., GMD2016) and AerChemMIP (Table 4 in Collins et al., GMD2016, attached). The ssp370 and ssp370-lowNTCF was comapred in AerChemMIP in Table 4, and stated "ssp370 is also specified as Tier 1 in ScenarioMIP", which was very easy to misunderstand.

Additionally, in the CMIP6_CV.json of the Climate Model Output Rewriter (CMOR) system, the activity_ids of ssp370 include ScenarioMIP and AerChemMIP, which might mean that each of the activity is acceptable, therefore we've finished the datasets under AerChemMIP in 2019.

Besides, also as stated in the CMIP6_CV.json, BCC-CSM2-MR is the major model version participating ScenarioMIP, while the activity_participation of BCC-ESM1 is AerChemMIP and CMIP.  Therefore we put the BCC-ESM1 simulated ssp370 under the AerChemMIP.

2. As response in question 1, we know that the ssp370 was one of the experiments in ScenarioMIP. Actually, we've published the datasets of another model version, i.e., BCC-CSM2-MR for different scenarios (ssp126,ssp245,ssp370 and ssp585). But there is only one scenario experiment conducted with BCC-ESM1, and several comparison experiments ( ssp370-lowNTCF, ssp370SST,ssp370SST-lowNTCF) are under AerChemMIP activity. Therefore it should be more convenient to put the ssp370 data under the same MIP.

We wouldn't object to publish the BCC-ESM1 ssp370 data under ScenarioMIP, would we?  If you think it is really necessary, we could modify the user-input information and processing again for the experiment. 

3. No, we haven't done anything special to CMORing and publish the data.  Because the data was processing in 2019, we need to check the related information of publishing data.

4. The prePARE was not updated in order to ensure the data release.

Hopefully these answers are clear for you.

Best regards,

Jin

    -----原始邮件-----
    发件人:"Taylor, Karl E."
    发送时间:2020-08-27 04:09:10 (星期四)
    收件人: bajin@cma, twwu@cma, janjh@cma.
    抄送: "WIP List"
    主题: Questions about BCC ssp370 simulations

    Dear Jin, Tongwen, and Jinghui,

    I’m writing on behalf of the WGCM Infrastructure Panel (WIP).  We seek your input on one of the confusing aspects of CMIP6 metadata, specifically assignment of the activity_id.  Except for the DECK and historical runs, most CMIP6 experiments are of direct interest to a single activity.  There are a few experiments, however, that are of interest to multiple activities.  One example is the ScenarioMIP ssp370 experiment, which is also of interest to the AerChemMIP activity.  When you published your BCC-ESM1 ssp370 output on ESGF, you identified it with the AerChemMIP activity, rather than the ScenarioMIP activity, which has primary responsibility for this experiment and is therefore considered its “owner”.    In contrast, when you published your BCC-CSM2-MR ssp370 output, you identified it with the ScenarioMIP activity.   Recent versions of the ESGF publisher that use PrePARE to check the metadata will not permit ssp370 data to be published under the AerChemMIP activity.

    So we want to understand what led to your choosing to publish the BCC-ESM1 ssp370 output under the AerChemMIP activity, rather than the ScenarioMIP activity. We do not want you to retract or republish this data because that would be confusing to those who have already accessed it and would involve work from lots of people, including your scientists and those of us supporting CMIP6 infrastructure.   What we would like to find out though is:

        When you published the ssp370 data, why did you choose to publish under the AerChemMIP rather than ScenarioMIP?  Although our documentation was not explicit nor clear about this, ssp370 is supposed to be invariably published under the ScenarioMIP activity.  Were you aware of this?  I’m guessing not. 
        Had you been aware that ScenarioMIP is considered the owner of ssp370, would you have objected to publishing your BCC-ESM1 data under that activity_id? 
        When you specified AerChemMIP as the activity for ssp370, did you have to do anything special to successfully publish?  Do you recall any errors raised in your first attempts?
        Do you know whether PrePARE is being run as part of the publication procedure at your data node?

    Thanks for all your help with this.  We hope to improve the clarity of our documentation and consider for CMIP7 alternative ways of handling experiments of interest to multiple activities.

    best regards,

    Karl on behalf of the WIP.
--
Dr. Jin Ba
Division of Climate System Modelling
National Climate Center
China Meteorological Administration
No.46 Zhongguancun Nandajie, Haidian
100081, Beijing, China
Tel. +86 10 6840 8605
Email: bajin@cma
Table 4.png

Table 4

durack1 commented 4 years ago

@MartinaSt thanks for that, I have just edited the above to remove complete email addresses, just so we're not the source of complete email addresses for spambots

MartinaSt commented 4 years ago

@durack1 Sure, thanks!

taylor13 commented 4 years ago

And here is a further correspondence from David Neubauer, responding to https://github.com/WCRP-CMIP/CMIP6_CVs/issues/951#issuecomment-682099506:

Dear Karl,

sorry for my late response. Thank you for the clarifications. I can only provide my personal perspective as a
 member of a modelling group taking part in CMIP6.

It's good that all MIPs are included in the activity_id and that an experiment (e.g. ssp370) can be found
 under any activity_id that requests the experiment.

For AerChemMIP ssp370 and variants e.g. ssp370-lowNTCF, ssp370SST have to be performed. The time
 period requested for ssp370 differs between ScenarioMIP (2015-2100) and AerChemMIP (2015-2055). 
Also the diagnostics requested by AerChemMIP and ScenarioMIP differ. Therefore I would find it more 
intuitive for data users if the ssp370-experiment would be published under AerChemMIP. If CMIP rather 
wants to MIP designing the experiment in the filename and citation then that's fine for me too. My primary 
interest is that our data is published and findable. And that there is a clear process how data needs to be 
CMORized and uploaded.

In answer to your questions:
1. We are a small modelling group and already had to re-process and re-publish ssp370 data because of this 
inconsistency in id's. We followed the instructions that where given to us. We don't have the resources to 
re-publish the data again.
2. If a modelling group registers for a MIP it should automatically be registered for all MIPS that designed 
experiments for this MIP. E.g. in CMIP6 registration for AerChemMIP would automatically lead to registration 
for ScenarioMIP and RFMIP. Or the modelling groups need to be made aware of that a registration for the 
MIPS designing an experiment is required. In my understanding this was not the case for CMIP6 and led to 
the issue we have had. It would be great if also smaller modelling groups can take part in CMIP, that don't 
have the resources to take part in all the MIPs, without unnecessary struggles.

Best regards,
David
taylor13 commented 4 years ago

Reviewing the current ESGF archive, I found only a few cases where models had incorrectly published data when 2 activities were supposed to be listed as the activity_id:

  1. For the ssp370 experiment, the BCC-ESM1 model omitted the primary activity, ScenarioMIP, and incorrectly published under the secondary activity, AerChemMIP. Their data is citable under AerChemMIP.

  2. for the ssp370 experiment, the MPI-ESM-1-2-HAM model republished their data correctly with activity_id="ScenarioMIP AerChemMIP", but the ESGF search interface link pointing to the citation information is broken. The reference, however, is found in the citation database by replacing in the citation_url "ScenarioMIP" with "AerChemMIP".

  3. for the esm-1pct-brch-1000PgC experiment, the CESM2 model omitted the secondary activity, CDRMIP. Their data is citable under C4MIP.

  4. for the land-hist experiment, the BCC-CSM2-MR model and the GFDL-ESM4 model omitted the primary activity, LS3MIP, and published incorrectly under the secondary activity, LUMIP. Their data is citable under LUMIP.

  5. for the land-hist experiment, the CESM2 model omitted the secondary activity. Their data is citable under LS3MIP.

I didn't check whether the activity_ids were listed in the correct order when both were listed in activity_ids (and so I didn't check whether the correct activity was used in generating the DRS i.d.'s.).

taylor13 commented 4 years ago

I suggest that nothing be done about the current deviations from the data requirements described in https://github.com/WCRP-CMIP/CMIP6_CVs/issues/951#issuecomment-701711008 . In all cases the data can be found on ESGF under at least one of the activity_id's associated with an experiment; if a user enters the name of the experiment (without specifying activity_id), all available output will be listed. With one exception all citation_url's can be successfully followed to the citation service, although in a few cases the citation will indicate the wrong activity (i.e., the secondary activity) is responsible for the experiment. I don't think this is a big deal.

The one exception to the above is that for the ssp370 experiment, the MPI-ESM-1-2-HAM model data is now correctly republished with activity_id="ScenarioMIP AerChemMIP", but the citation_url recorded in the ESGF catalog points to a non-existent reference. To get to a citable reference to this data a user must replace "ScenarioMIP" with "AerChemMIP" in the citation_url.

I don't think it worth the effort to correct this small blemish. If we adopt this course, HAMMOZ won't have to republish their data (David's main concern expressed in https://github.com/WCRP-CMIP/CMIP6_CVs/issues/951#issuecomment-701673162 above) and no one involved in the infrastructure will have to do anything.

We should, however, make sure that in the future everyone includes PrePARE checks as part of their publication procedure and all groups include all activities in their activity_participation list in the source_id registration. How can we make sure that's the case?

MartinaSt commented 4 years ago

@taylor13 The citation_url published in the ESGF index is a tiny part of the citation service and thus cannot be discussed individually but requires the discussion of the citation service to ensure its consistency. The citation_url is completely consistent with the DRS uniquely identifying the cited data collection. Any other citation_url would make the citation service inconsistent. Thus I would not describe it as wrong, where there is an incompatibility between the data citation concept and the co-sponsoring concept.

As there is a strong preference within the WIP for the co-sponsoring solution, let's discuss how to deal with it in the citation service. For this case, the data citation is broken and there is no way to mend it. We just have two options, how to deal with it. As there are no other cases among the CMIP6 data where the citation is broken, we can treat the Hammoz data case as a special case.

Option 1 - add two more DOIs for the ScenarioMIP data contribution and for the ssp370 experiment: These two DOIs will point to exactly the same data collections, which is bad practice and makes any impact study difficult or less meaningful. Moreover the relation between ScenarioMIP and ssp370 data is meaningless as well. Therefore an information on the landing pages should be added:

It is not guaranteed that data users will read these notes. We provide different sources to find data references, even script-based possibilities. Therefore these notes on the landing pages will not always be read, as data users expect that the data reference for Hammoz's ScenarioMIP contains Hammoz's contribution to this MIP same as for all other ScenarioMIP data references. Thus, we will have wrong and missing data citations regardless of the documentation.

Option 2 - leave this one experiment ssp370 without a data citation: This option avoids the bad practice to have two DOIs on the same data collection as well as the problems related to this: in terms of documentation, data impact studies, wrong data citation usage (data citation of Hammoz's non-existing ScenarioMIP contribution) etc., which I have described in Option 1. This option also means less effort for everyone . The disadvantage is that the ssp370 data remains non-citable.

Which option would you prefer, @nedavid ? What is your preference, @taylor13 ?

Notes:

taylor13 commented 4 years ago

I would favor option 2. Under the downside you mentioned ssp370 data can't be specifically cited for this model. However, to give credit to HAMMOZ in a publication, couldn't an author include the following reference available from the citation service? The DOI is a valid one.

cite
taylor13 commented 4 years ago

With regard to DKRZ possibly assisting HAMMOZ in reprocessing and republishing the data, I don't recall why that is needed? The ssp370 data has already be published with the correct global attribute activity_id = "ScenarioMIP AerChemMIP" and all the dataset identifiers are correct on ESGF.

nedavid commented 4 years ago

Thank you for your help @MartinaSt and @taylor13. What @taylor13 suggests seems reasonable. A note on the AerChemMIP landing-page to explain that counter-intuitively one experiment namely ssp370 is not included and has to be cited as: https://doi.org/10.22033/ESGF/CMIP6.1621 would be very helpful.

So far this only concerns the ssp370 data. But we will publish also data for the piClim-aer and piClim-control experiments, which are hosted by RFMIP and co-sponsored by AerChemMIP (see the comment by @durack1 above).

What is your recommendation to publish these two experiments? As I understand we should register for RFMIP. But will it be possible to get a citation for these two experiments?

MartinaSt commented 4 years ago

Thanks @taylor13 and @nedavid for your feedback.

If we go with option 2, no registration for ScenarioMIP nor RFMIP is required. The piClim-aer and piClim-control experiments will add another two exceptions with broken data citations, which will be not citable. Sorry, David, this is the same case as ssp370.

@taylor13 I have mentioned the support that DKRZ could provide for re-processing and re-publication of the data to make clear that the resource constraint mentioned by David is not an obstacle to the solution, which Michael favored in his email from 08.09.2020:

... But I think my position to open the ESGF publication restrictions is clear from the related email discussion. Modelling groups which only participate for an experiment in the co-sponsoring MIP should have the opportunity to publish and to cite in this MIP.

So, for completeness of all possible options, there is the option to break the strict co-sponsoring rule and republish ssp370 under AerChemMIP. DKRZ could do the required data processing. This option would not break the data citation (all experiments and MIPs provided are citable in an intuitive way), and this would also make transparent to data users that the ssp370 data from Hammoz was created with AerChemMIP diagnostics. Severe problems in other infrastructure parts like the data replication have not been raised, so far, at least I am not aware of such a problem.

We all have spent quite some time on this issue already, so we should come to a final decision on Tuesday between the possibilities:

As Michael wrote: We at DKRZ shall accept the WIP decision.

durack1 commented 4 years ago

@taylor13 I am a little confused if there is something to be done here in the CMIP6_CVs, or whether this has been reopened so the citation tweaks can be completed, and once this is done, closed?

taylor13 commented 4 years ago

I reopened this so that we could capture some of the discussion that affects what activities should be recorded in source_id CV. There is no need for any changes to the CV processing, although I think HAMMOZ may need to add and activity or two to their activity_participation list.

So let's keep it open until the way forward has been decided.

MartinaSt commented 4 years ago

@taylor13 @durack1 No, there is no need to register additional activity_participations for MIPs, in which HAMMOZ does not participate, for both remaining possibilities:

taylor13 commented 3 years ago

The WIP provided its final guidance on this in a recent email to @wachsylon and @nedavid:

1) Your ScenarioMIP experiments should remain unmodified in the CMIP6 archive and will remain uncitable. 2) Your forthcoming piClim-aer and piClim-control experiments should be published and will be citable following the current CMIP6 approach, which requires a) assigning activity_id = "RFMIP AerChemMIP" to the global attribute in each of your output files. b) adopting the recommended directory structure for your output files, which uses "RFMIP" rather than "AerChemMIP" as the activity_id sub-directory. (This may not be absolutely necessary, but it is safer, more convenient, and consistent with what most other modeling groups have done, so we urge you to adopt that structure..) 1) Your ScenarioMIP experiments should remain unmodified in the CMIP6 archive and will remain uncitable. 2) Your forthcoming piClim-aer and piClim-control experiments should be published and will be citable following the current CMIP6 approach, which requires a) assigning activity_id = "RFMIP AerChemMIP" to the global attribute in each of your output files. b) adopting the recommended directory structure for your output files, which uses "RFMIP" rather than "AerChemMIP" as the activity_id sub-directory. (This may not be absolutely necessary, but it is safer, more convenient, and consistent with what most other modeling groups have done, so we urge you to adopt that structure..)

I'll close this issue now.