Source_id augmentation to serve needs of ES-Doc

durack1 commented 1 year ago

@bnlawrence @davidhassell @matthew-mizielinski @taylor13 thanks for the chat today gents.

It would be useful to centralize discussions about augmenting the CVs into this repo, you will see a number of issues already exist, and @matthew-mizielinski has a prototype/placeholder branch to kick things off which we can iterate over.

@bnlawrence I had thought to drop your *.pdf attachment here, but might let you organize your content the way you like - feel free to submit a PR, we could start collecting stuff into a src or equivalent subdir if that makes sense.

Adding a link to Karl's 2019 google doc that outlines a framework suggestion moving forward

Just dropping a second link to a document CMIP6 Infrastructure Component Dependencies and Version Management Strategy that summarizes a number of the connections between infrastructure in CMIP6 - a good place to start from when thinking about what content is contained in the CVs and how this content serves other downstream services.

bnlawrence commented 1 year ago

This is the document we discussed, ES-DOC and CMIP source vocabularies.pdf

durack1 commented 1 year ago

Also adding the very useful citation-relevant spreadsheet link here from @MartinaSt!

MartinaSt commented 1 year ago

And adding as solution for the main pain point from the citation service:

An unchanging identifier should be added to each entry for those CVs, where short_names might change over time, e.g. source_id, institution_id.

This identifier enables automated updates of these contents in other services like citation. Currently, a manual check if the "new" model short_name is really a new one or one with an updated short_name is required.

durack1 commented 1 year ago

It would be useful in addition to the institution_id having a URL and ROR (if these exist), but potentially the source_id to have an entry for additional resources, such as the info for the HadGEM3-GC31-LL configs at https://ukesm.ac.uk/cmip-es-documentation/ - this might be a way of catching info that exists early, and as other model doc libraries are being built, these could be augmented

taylor13 commented 1 year ago

As part of the source documentation effort, I envisioned “source_id CV” becoming a "source registry", a name that better reflects that it should contain much more than “controlled vocabulary” and that much of the content is "registered" by contributors rather than the registry manager(s)/adminstrator(s).

I also envisioned that each model would register information critical to making sense of its results. This would include to what extent and how it conformed to the experiment requirements. I would need to state what each “i”, “p”, and “f” value in the “ripf” variant labels

For each “collection” in the data archive a source registry could be defined, but more commonly multiple collections that are part of a compound collection will all choose to adopt the same registry. For CMIP7, it is likely that at least 3 different source registries will be defined, one each for: models, input4MIPs sources, and obs4MIPs sources.

In order to sensibly make use of the datasets in the archive, one needs to know for each source, its name and origin, its general characteristics, and for climate models, what major component models are included, as well as for each simulation it performs what specialized initialization procedures, physics representations and forcing have been applied. All of these could appear in a source registry:

Family name (usually omitting a version indicator)
Version (modifier of family name usually used in constructing the label)
label
source_id (same as label, but with forbidden characters replaced by “-“; e.g., “UKESM1-0” instead of “UKESM1.0”)
extended label
full name
release year
license info, including exceptions contact, history, license label (i.d.), license text, license url, data-provider-specific info.
reference (usually a URL that includes a doi or a link to a reasonably permanent webpage)
predecessors
institutions
contact info, including name, email, address, phone
Product “type” produced by the source. Current options: a. Forcing datasets b. Model experiment results c. Observations
Components (for models, the name of the component atmosphere, ocean, land … models) and for each provide family name, version, grid/resolution info, and a brief description.
Definitions of the “i” and “p” indices used in distinguishing among various simulations by the model.
Simulation conformance information (for each activity/experiment/sub-experiment contributed to): a. Active model components (or possibly source type?) b. Variant status categorizing for each “ripf” ensemble member whether it is: a “preferred” run (in the eyes of the data provider), “compliant” with the experiment specifications, or “noncompliant”. Any “preferred” runs must also be “compliant”. c. Calendar (with a value drawn from the CF convention CV) d. Start time e. End time f. Parent collection g. Parent experiment h. Parent variant i. Time spawned by parent j. Variant specialization (needed only if different values of 16a-i above apply to different ensemble members (for each variant needing it, the same keys listed in a-i above could be given different value than the default). k. Associate forcing indexes with defined forcing packages and registered forcing datasets For each forcing label (e.g., “f1”, “f2”) appearing in a source variant, specify: • forcing status (compliant or noncompliant with experiment specs) • registered forcing suite defining all forcing datasets needed for experiment • list of input4MIPs datasets added to the registered suite • list of input4MIPs datasets removed from the registered suite • list of input4MIPs datasets substituted for one of the datasets in the suite • Notes providing additional information about forcing that might be useful to those analyzing the simulations relying on

Consider also including:

Activity_participation
Cohort (not sure this is needed; could point to a cohort registry indicating criteria for belonging to a cohort).

One rendering of this information can be found in the source registry example given in https://docs.google.com/document/d/17cYDEBCXUxcOq8pqqmBNHqKUFvPIQqKhWaBKz01DI4E/edit . Note that in defining the forcing labels (see 16k above), a “forcing suites registry” is referenced and in defining model components, a grid registry is referenced. Examples of these are also included in the referenced document.

bnlawrence commented 1 year ago

This is a full amalgamation of most of es-doc into the source registry, which is certainly one way of upping the ante on getting the information. However, one of the issues which is buried in this proposal is a separation of "who knows what when?" which was a key tenet of es-doc thinking (and our experience working with the modelling groups).

The underlying assumption here is that the registration of the source will happen when all the simulations are known - which is the only way the i/p and conformance information can be known when first registered. Of course one could have a process for updating the source description for new simulations/experiments as they are run/supported, but I think this breaks the principle of maintainability of information. I'd rather these were separate documents (i.e. one for the list of r/i/p and one for conformance information) - then the source model description doesn't get updated all the time (I am uneasy about that changing over time when it shouldn't, and not changing when it should).

I'd also rather we didn't duplicate information (e.g. the parent variant and simulation times are in the data files and easily extracted from there).

(Could you please extract the "rendering of this information" example from the long google document, I'm not sure which one to look at).

taylor13 commented 1 year ago

You make some good points. I had also thought about providing some of the conformance info. in a separate registry, which, as you say, would need to be updated much more frequently than a less comprehensive source registry. It does break up the model information across multiple registries, which is somewhat less convenient to a human attempting to find out information about a model simulation from the file, but your "maintainability" argument perhaps trumps that.

I also had thought about the point about "duplicate" information. The motivation is that certain metadata found in the files seemed to be difficult to get right by some modelers. branch times and "parents", for example, were often reported incorrectly. And some information that users seem to want to extract regarding a simulation (e.g., first and last time slice reported) is difficult to extract without downloading the files (or parsing all the filenames in a dataset).

I was hoping that the registry could provide the reference values for some of this information, overriding the values stored in the files (if those were wrong). A modeling group could easily update this information in the registry; it would be much more difficult (and many groups will simply fail) to rewrite the files themselves correcting the attributes.

Happy to discuss and iterate further. Will also extract my attempt to capture the source information. [after posting, edited to replace "less inconvenient" with "less convenient" ... sorry about that.]

bnlawrence commented 1 year ago

I agree that stuff in multiple places is not how we want to communicate it. The issue as usual is that what we want to do for the producer is not the same as what we want to do for the consumer. I think we have to up our game in terms of easy to use tools; for example, it should be trivial to pull any stuff we want for the source registries (plural important) which is in the dat files out of the files, so no one types the stuff in more than once. More on this anon.

MartinaSt commented 1 year ago

I'd like to add some comments on "maintainability", "responsibility" and purpose or function of the CV. Therefore I step a little back from the source_id content discussion and share some general remarks:

Information reused across phases, MIPs, infrastructure components or even within an individual component like persons, institutions or references: We should have a central database maintaining this information with required content for all partners. With persons there might be some GDPR concerns.
Dependencies between CVs: To make the CVs useful for the data citation the combinations of the CVs in the DRS is important. The DRS only specifies the order of the CVs but CVs need to contain allowed combinations. Currently we have: institution-source-activity and activity-experiment. Any additional information not contained in the CVs needs to be avoided like the publication of an experiment co-listed in two MIPs in a specific one even if the model does not participate in it.
Add unique non-changing ids to every source_id entry: As source_id change with changing model versions, they are not reliable as key for accessing the CV. An additional id solves that problem. With the growing complexity of the CVs we should consider to define a metadata schema and use a relational database.
ES-DOC vs. source_id CV: I think of the CV as provider of basic information on activities and on the participants with their contributions. Details should remain in ES-DOC. Therefore I suggest that the source_id contains a reference to an ES-DOC DOI
Responsibilities / Functionalities of different central components (in my view):
- Central and heavily reused information should be maintained by the IPO together with contact information. I think of person, institution, and references as these central information to serve to the infrastructure partners.
- The CVs contain public basic content on activities and on the participants with their contributions maintained by IPO.
- An index (could be ESGF index) connects data and related information and makes them accessible in a GUI and through an API. Apart from the dataset information, these are ES-DOC, data citation and CV. Currently, furtherinfo connected to the files provides a GUI for information, same does the citation landing page. The citation url in the ESGF index provides machine access. But in the end, we need to reduce interfaces/exchange of information and define where the "master copy" is.

One more specific comment:

References for a source_id could be model documentations (ES-DOC DOIs), papers on model evaluations (crossref DOI) or a model code (software) DOI. They are important to connect these contents and make them accessible for e.g. knowledge graphs sometimes also called PID graphs.

WCRP-CMIP / CMIP6Plus_CVs

Source_id augmentation to serve needs of ES-Doc #8