WCRP-CMIP / CMIP6_CVs

Controlled Vocabularies (CVs) for use in CMIP6
Creative Commons Attribution 4.0 International
155 stars 79 forks source link

Models listed in the CVs without published data #1028

Closed matthew-mizielinski closed 2 years ago

matthew-mizielinski commented 3 years ago

Following discussion on #512 I've scraped together data from the ESGF search pages (list of source ids) and the source id list within the CVs to pull out the following table of models where no data appears to be available at the time of writing (July 2021).

This includes a number of institutions where no data has been published for their models and one institution without any models

There are a total of 28 models in the table below with a further 4 registered in the last 12-18 months.

I'm not currently advocating purging all of these, but I think it worth a discussion as to how to handle this

Institution ID Source ID Release Year Activity Participation Notes
AWI AWI-ESM-2-1-LR 2019 CMIP PMIP
BNU BNU-ESM-1-1 2016 C4MIP CDRMIP CFMIP CMIP GMMIP GeoMIP OMIP RFMIP ScenarioMIP No data for institution
CNRM-CERFACS CNRM-ESM2-1-HR 2017 CMIP OMIP ScenarioMIP
CSIR-Wits-CSIRO VRESM-1-0 2016 CMIP DAMIP HighResMIP PMIP ScenarioMIP No data for institution
EC-Earth-Consortium EC-Earth3-GrIA 2019 CMIP ISMIP6 PMIP
EC-Earth3-HR 2019 CMIP DCPP HighResMIP
GFDL GFDL-GLOBAL-LBL 2019 RFMIP
INPE BESM-2-9 2019 CMIP DCPP ScenarioMIP
IPSL IPSL-CM7A-ATM-HR 2019 HighResMIP
IPSL-CM7A-ATM-LR 2019 HighResMIP
MESSy-Consortium EMAC-2-53-Vol 2017 CMIP VolMIP No data for institution
EMAC-2-54-AerChem 2018 AerChemMIP CMIP
MIROC MIROC-ES2H-NB 2019 AerChemMIP CMIP
NICAM16-9D-L78 2017 CFMIP CMIP
MOHC NERC UKESM1-0-MMh 2018 AerChemMIP C4MIP CMIP ScenarioMIP Data not expected
UKESM1-ice-LL 2019 ISMIP6 Processing in progress
MPI-M ICON-ESM-LR 2017 CMIP OMIP SIMIP
NASA-GISS GISS-E2-2-H 2021 CMIP SIMIP ScenarioMIP
NASA-GSFC No models registered (there is input4MIPs data)
NCAR CESM2-SE 2019 CMIP, HighResMIP
NCC NorESM2-HH 2018 CMIP HighResMIP
NorESM2-LME 2017 C4MIP CMIP GeoMIP LUMIP OMIP
NorESM2-LMEC 2017 AerChemMIP CMIP
NorESM2-MH 2017 AerChemMIP CFMIP CMIP DAMIP OMIP RFMIP ScenarioMIP
PCMDI PCMDI-test-1-0 Testing record
PNNL-WACCEM CAM-MPAS-HR 2018 HighResMIP No data for institution
CAM-MPAS-LR 2018 HighResMIP
UofT UofT-CCSM4 2014 CMIP PMIP No data for institution
UTAS CSIRO-Mk3L-1-3 2006 CMIP PMIP No data for institution

There are also the following recent additions (2020 and 2021 release years)

Institution ID Source ID Release Year Activity Participation Notes
CSIRO-COSIMA ACCESS-OM2 2020 OMIP No data for institution
ACCESS-OM2-025 2020 OMIP
IPSL IPSL-CM6A-MR025 2021 CMIP
IPSL-CM6A-MR1 2021 CMIP
durack1 commented 3 years ago

@matthew-mizielinski @taylor13 let's centralize discussions here. As I noted, I already have code that pulls info from the CMIP6 (or 5, 3) indexes and will return information such as that found in durack1/CMIPOcean/CMIP_ESGF.json

taylor13 commented 3 years ago

I suggest not purging any registered source_ids or institution_ids at this time, but I think that, If practical, we should:

  1. update the "cohort" classification for each model. I would only allow three options for CMIP6 models at this time: "registered" or "DECK" or "CMIP, DECK". Models would be designated "DECK" if they have contributed results from the following 4 experiments: amip, abrupt_4xCO2, 1pctCO2, and piControl (or esm-piControl). They would be designated "CMIP, DECK" if in addition they have contributed results from historical (or esm-hist or historical-cmip5) Otherwise they would continue to be designated "registered".
  2. encourage users to consult the "ESGF CMIP6 Data Holdings" summary at https://pcmdi.llnl.gov/CMIP6/ArchiveStatistics/esgf_data_holdings/ , and advising that "models classified as only "registered" may not have completed all the simulations needed for a baseline assessment of their suitability for use in climate research.
  3. Eliminate "cohort" from the current ESGF search interface.

Regarding the update of "cohort" classification, we can be guided by the initially suggested policy on cohort designations (https://goo.gl/zDHUk7; 9 January 2018


The only choices permitted under the “Model Cohort'' category are the following: DECK, CMIP6, CMIP5, CMIP3, CMIP2, CMIP1, “CMIP6-fringe”, and “Registered”. A “Model Cohort'' limits a search to models that meet certain MIP criteria (for example, completion of 4 DECK experiments plus the historical simulation is usually required to be included in the “CMIP6” cohort). The CMIP panel will record and update the “Model Cohorts'' that each source_id (i.e., model) belongs to the reference source_id CV found at WCRP-CMIP/CMIP6_CVs/CMIP6_source_id.json. Only models that qualify for at least one “Model Cohort'' shall be considered for inclusion in a search result. The following define the cohorts:

  1. A model that has registered it’s intention to participate in CMIP6 (at WCRP-CMIP/CMIP6_CVs) but has not qualified for any other cohort belongs to the “Registered” cohort.
  2. A model that completes all the DECK simulations belongs to the "DECK" cohort.
  3. A model that completes the DECK and CMIP6 historical simulations belongs to the "CMIP6" cohort. The CMIP Panel may choose to relax this requirement on a case-by-case basis and designate some models as belonging to the CMIP6 cohort even if only a subset of the “required” simulations has been completed.
  4. A model that fails to qualify for the “DECK” or “CMIP6” cohorts but performs at least one of the CMIP6 experiments and does meet the criteria of at least one of the endorsed MIPs in CMIP6 belongs to the "CMIP6-fringe" cohort.
  5. A model that participated in CMIP5, CMIP3, CMIP2, and/or CMIP1, belongs to the correspondingly named cohort.
  6. A model may belong to multiple cohorts (e.g., “CMIP6, DECK”)

MartinaSt commented 3 years ago

I agree with Karl for 1. and 2.

  1. ESGF search facet: I would keep it. The updated "Model Cohort" information is a quality criteria for a model contribution in terms of completeness and compliance to the CMIP6 guidelines. ESGF users benefit from the possibility to restrict the search without having to consult an external resource. And more importantly, we can encourage modeling centers to comply to CMIP guidelines by displaying this kind of quality flag for model contributions in the ESGF portals.
    • We might consider to rename the ESGF search facet. All of cause, only if practical.
taylor13 commented 3 years ago

Yes, I agree the Model Cohort could provide information of value to users. The reasons for possibly removing it as a search facet are:

  1. Software would need to be written to automatically update the ESGF database (Solr?) to reflect the changes to the source_id CV (json file) made when a model's cohort status changed.
  2. All index nodes would have to implement the updates.
  3. If 1 and 2 cannot be accomplished practically, then the information contained in the current "Model Cohort" list on ESGF will be incorrect (since all models are currently only designated as being "registered"). If the information is wrong, perhaps we should hide it from the users by removing the facet. (Of course that will require changes that would affect all tier 1 nodes at least and also may be impractical.)

Perhaps Sasha might say if any of the above is based on my misunderstanding ESGF.

durack1 commented 3 years ago

@sashakames there is a query above directed your way

sashakames commented 3 years ago

We can achieve it if its worth the effort. (1) is easier to do than (2). It could take weeks at LLNL for scripts to complete for all our 5M replica records. I'm open to dropping the facet

sashakames commented 3 years ago

Sorry, I must have too much else on my mind... There is a simple command to update all records that match a query. So each site just needs to re-run a query/update operation periodically. If we have new records published in the correct cohort, we can drop the need to make the corrections.

taylor13 commented 3 years ago

So I take it 2 is easier than 1?

sashakames commented 3 years ago

Other way around (2) involves herding cats, should also mention we need to check for the performance implications of doing updates in bulk which complicates things

taylor13 commented 3 years ago

Got it. Executing (2) is technically trivial; getting folks to execute it could be difficult. On the other hand (1) requires some effort by PCMDI to write scripts: 1) to periodically check the ESG database and update the source_id CV so that it reflects the true "cohort" status for each model, and then 2) to transfer the updates from the CV to ESG and correct the ESG archive's database index.

(Again, @sashakames, I've probably not understood, so please correct, as needed, the above.)

sashakames commented 3 years ago

I was thinking of 1.2 (esgf index update phase) being not too challenging for me to implement. The query part of 1.1: doesn't ChrisM's "Big Table" have this already - experiments for each model? so we could leverage that, but performing the queries I wouldn't consider too challenging, if need be.

To clarify the concern, the bulk updates might time out if there are 100000s of records to process for each in bulk. If this is problematic we would need to play with the granularity of update (eg do one experiment at a time).

Ideally once a model has changed cohort, we ask them to update their publisher config to have the cohort value set correctly, then we don't need to correct them again until the next change. And same goes for replica publishing.

durack1 commented 3 years ago

A specific case that needs to be accounted for is https://github.com/WCRP-CMIP/CMIP6_CVs/issues/512

taylor13 commented 3 years ago

From WIP meeting discussion:

durack1 commented 2 years ago

As part of #1066 models that have no published data on ESGF have been left as "cohort" = ["Registered"], whereas models that have data have been updated to "cohort" = ["Published"].

It would be possible to contact the modeling groups of the non-published models, not sure we'd want to deregister any specific model

durack1 commented 2 years ago

All models that do not currently have data available anywhere on ESGF now have the entry "cohort": ["Registered"].

All models that have data have an updated entry as "Published" - as per #1066

durack1 commented 2 years ago

An email was sent out today requesting an update for the 28 models that currently have no data published on ESGF. The request was for data to be published, or for deregistration to occur - once we have intel from these contacts, we can amend as required and close out this issue

durack1 commented 2 years ago

@matthew-mizielinski I am closing this as a dupe (somewhat) of #1050, which includes the table of 28 models that are registered but missing data published on ESGF which are now down to 14 in the updated table below. The process of identifying these, and either deregistering or awaiting an update for imminent publication is already underway and noted in #1076, #1078, #1079, #1083, and #1086, and the NorESM2* deregistrations - see #1079/#1084.

Updated 220701 - last merged PR #1126

count status source_id MIPs LLNL files ESGF datasets contact status
#1076 awaiting publication EC-Earth3-GrIS CMIP ISMIP6 PMIP - none
#1076 awaiting publication EC-Earth3-HR CMIP DCPP HighResMIP - none
#1083 awaiting update GFDL-GLOBAL-LBL RFMIP - none
#1078 awaiting publication IPSL-CM6A-ATM-ICO-HR HighResMIP - none
#1078 awaiting publication IPSL-CM6A-ATM-ICO-LR HighResMIP - none
#1078 awaiting publication IPSL-CM6A-ATM-ICO-MR HighResMIP - none
#1078 awaiting publication IPSL-CM6A-ATM-ICO-VHR HighResMIP - none
#1078 awaiting publication IPSL-CM6A-ATM-LR-REPROBUS AerChemMIP - none
#1078 awaiting publication IPSL-CM6A-MR025 CMIP - none
#1078 awaiting publication IPSL-CM6A-MR1 CMIP - none
#1105 awaiting publication CAM-MPAS-HR HighResMIP - none Ruby, Bryce, Koichi emailed
#1105 awaiting publication CAM-MPAS-LR HighResMIP - none Ruby, Bryce, Koichi emailed
#1116 #1117 awaiting publication AWI-ESM-2-1-LR CMIP PMIP - none Tido, Christian, Gerrit, Martin, Paul and Christopher emailed
#1093 deregistered NICAM16-9D-L78 CFMIP CMIP - none
#1087 deregistered NorESM2-HH CMIP HighResMIP - none
#1100 #1103 deregistered BNU-ESM-1-1 C4MIP CDRMIP CFMIP CMIP GMMIP GeoMIP OMIP RFMIP ScenarioMIP - none Duoying emailed
#1102 #1106 deregistered CESM2-SE CMIP HighResMIP - none Gokhan & Gary emailed
#1101 #1104 deregistered CNRM-ESM2-1-HR CMIP OMIP ScenarioMIP - none David, Gaelle, Laurent and Marie-Pierre emailed
#1111 #1112 deregistered EMAC-2-53-Vol CMIP VolMIP - none 5 addresses emailed
#1111 #1112 deregistered EMAC-2-54-AerChem AerChemMIP CMIP - none 5 addresses emailed
#1122 #1123 deregistered VRESM-1-0 CMIP DAMIP HighResMIP PMIP ScenarioMIP - none Francois/Pedro emailed - deregister 220630
#1086 #1124 deregistered UofT-CCSM4 CMIP PMIP - none dchandan pinged - deregister 220630
#1120 #1125 deregistered BESM-2-9 CMIP DCPP ScenarioMIP - none Andre & Paulo emailed - deregister 220630
#1121 #1126 deregistered CSIRO-Mk3L-1-3 CMIP PMIP - none Steve emailed - deregister 220630
durack1 commented 2 years ago

@matthew-mizielinski I realised that closing this wasn't the best idea, as we need somewhere to keep track of the remaining unresolved/deregistrations, so will reopen and update the table above as required. 12 remaining questions to answer.

durack1 commented 2 years ago

@matthew-mizielinski et al, all models with no data and no intention to publish data imminently have now been deregistered, so I can close out this issue, with the remaining license updates to be dealt with by #1113