WCRP-CMIP / CMIP6Plus_CVs

Controlled Vocabularies (CVs) for use in CMIP6Plus
Creative Commons Attribution 4.0 International
3 stars 4 forks source link

remove "collection" from individual CVs #57

Open taylor13 opened 5 months ago

taylor13 commented 5 months ago

I think we can avoid lots of headaches if we adopt the following terms: a dataset collection (DScollection) is a collection of datasets that all rely on a common collection of CVs (CVcollection). I introduced the concept of dataset collections a few years ago, and I have become convinced it is essential in thinking about the various WCRP-related datasets.

And why not define a CVcollection in a json file (rather than including the information in every CV included in a collection)? That is, remove the collection information from each of the CV's and instead list all the CV's that belong to each CV collection in a separate json file. A contributing CV could be included in multiple CV collections. Each time any of the contributing CVs were modified, a new CV collection would be defined which would be the same as the previous collection except for the presumably few contributing CVs that have changed.

Schematic of CVcollection file:

"CVcollection": {
      "CMIP6plus_CVcollection":{
                 "collection_version":"6.5.1.0", 
                 "CVcollection_modified":"2023-11-20T16:32:10Z",
                  "CVcollection_release":"??"
                  "contents":{
                            "CMIP6plus_DRS":"v6.5.0.8", (could name "MIPs_DRS" and indicate 6plus by the version: 6.5.?.?)
                            "MIPs_product":"v1.1.1.1",  ("MIPs" prefix indicates that this uses a CV for product that is 
                                                                                  likely useful across MIP phases 
                                .
                                .
      "CMIP6plus_CVcollection":{
                 "collection_version":"6.5.1.1",
                 "CVcollection_modified":"2024-02-20T10:32:00Z",
                  "CVcollection_release":"??"
                  "contents":{
                            "CMIP6plus_DRS":"v6.5.0.8", (version indicates CMIP6plus)
                            "MIPs_product":"v1.1.1.2",  (version indicates product is same 
                                                                 as version 1.1.1.1 , but with additional options for product)
                                .
                                .

This way of doing things makes it clear what has changed from one CV collection version to the next. Also different CV collections can draw on a common set of CVs (e.g., obs4MIPs and input4MIPs might rely on the same "frequency" CV as CMIP).

This also makes it easy for us to clearly indicate which CVs apply to each proposed "dataset collection" (DScollection). Each dataset in a particular DScollection would have to conform to the specifications found in a single CVcollection. (We could allow the least significant digits of the CVcollection version to be different (i.e. datasets conforming with CVcollection version 6.0.2.8 and 6.0.2.9 could be included in a single DScollection.) For example, adding a new source_id to a CV wouldn't disrupt datasets already published because the new CVcollection would be backward compatible with the old, only including additional options for source_id. Thus, datasets conforming with 6.0.2.8 and 6.0.2.9 could be included together in a single DScollection.

The individual CVs included in a CVcollection would not record what collections they belong to. So the "collection" portion of the "header" currently found in each CV would be omitted. Each individual CV then would be independently versioned.

Of course we could copy from the master CV repository all the CVs comprising a CVcollection and bundle those together to make the collection easy to obtain (by CMOR or ESGF, for example).