Open durack1 opened 1 year ago
This is the document we discussed, ES-DOC and CMIP source vocabularies.pdf
Also adding the very useful citation-relevant spreadsheet link here from @MartinaSt!
And adding as solution for the main pain point from the citation service:
An unchanging identifier should be added to each entry for those CVs, where short_names might change over time, e.g. source_id, institution_id.
This identifier enables automated updates of these contents in other services like citation. Currently, a manual check if the "new" model short_name is really a new one or one with an updated short_name is required.
It would be useful in addition to the institution_id
having a URL and ROR (if these exist), but potentially the source_id
to have an entry for additional resources, such as the info for the HadGEM3-GC31-LL configs at https://ukesm.ac.uk/cmip-es-documentation/ - this might be a way of catching info that exists early, and as other model doc libraries are being built, these could be augmented
As part of the source documentation effort, I envisioned “source_id CV” becoming a "source registry", a name that better reflects that it should contain much more than “controlled vocabulary” and that much of the content is "registered" by contributors rather than the registry manager(s)/adminstrator(s).
I also envisioned that each model would register information critical to making sense of its results. This would include to what extent and how it conformed to the experiment requirements. I would need to state what each “i”, “p”, and “f” value in the “ripf” variant labels
For each “collection” in the data archive a source registry could be defined, but more commonly multiple collections that are part of a compound collection will all choose to adopt the same registry. For CMIP7, it is likely that at least 3 different source registries will be defined, one each for: models, input4MIPs sources, and obs4MIPs sources.
In order to sensibly make use of the datasets in the archive, one needs to know for each source, its name and origin, its general characteristics, and for climate models, what major component models are included, as well as for each simulation it performs what specialized initialization procedures, physics representations and forcing have been applied. All of these could appear in a source registry:
Consider also including:
One rendering of this information can be found in the source registry example given in https://docs.google.com/document/d/17cYDEBCXUxcOq8pqqmBNHqKUFvPIQqKhWaBKz01DI4E/edit . Note that in defining the forcing labels (see 16k above), a “forcing suites registry” is referenced and in defining model components, a grid registry is referenced. Examples of these are also included in the referenced document.
This is a full amalgamation of most of es-doc into the source registry, which is certainly one way of upping the ante on getting the information. However, one of the issues which is buried in this proposal is a separation of "who knows what when?" which was a key tenet of es-doc thinking (and our experience working with the modelling groups).
The underlying assumption here is that the registration of the source will happen when all the simulations are known - which is the only way the i/p and conformance information can be known when first registered. Of course one could have a process for updating the source description for new simulations/experiments as they are run/supported, but I think this breaks the principle of maintainability of information. I'd rather these were separate documents (i.e. one for the list of r/i/p and one for conformance information) - then the source model description doesn't get updated all the time (I am uneasy about that changing over time when it shouldn't, and not changing when it should).
I'd also rather we didn't duplicate information (e.g. the parent variant and simulation times are in the data files and easily extracted from there).
(Could you please extract the "rendering of this information" example from the long google document, I'm not sure which one to look at).
You make some good points. I had also thought about providing some of the conformance info. in a separate registry, which, as you say, would need to be updated much more frequently than a less comprehensive source registry. It does break up the model information across multiple registries, which is somewhat less convenient to a human attempting to find out information about a model simulation from the file, but your "maintainability" argument perhaps trumps that.
I also had thought about the point about "duplicate" information. The motivation is that certain metadata found in the files seemed to be difficult to get right by some modelers. branch times and "parents", for example, were often reported incorrectly. And some information that users seem to want to extract regarding a simulation (e.g., first and last time slice reported) is difficult to extract without downloading the files (or parsing all the filenames in a dataset).
I was hoping that the registry could provide the reference values for some of this information, overriding the values stored in the files (if those were wrong). A modeling group could easily update this information in the registry; it would be much more difficult (and many groups will simply fail) to rewrite the files themselves correcting the attributes.
Happy to discuss and iterate further. Will also extract my attempt to capture the source information. [after posting, edited to replace "less inconvenient" with "less convenient" ... sorry about that.]
I agree that stuff in multiple places is not how we want to communicate it. The issue as usual is that what we want to do for the producer is not the same as what we want to do for the consumer. I think we have to up our game in terms of easy to use tools; for example, it should be trivial to pull any stuff we want for the source registries (plural important) which is in the dat files out of the files, so no one types the stuff in more than once. More on this anon.
I'd like to add some comments on "maintainability", "responsibility" and purpose or function of the CV. Therefore I step a little back from the source_id content discussion and share some general remarks:
One more specific comment:
@bnlawrence @davidhassell @matthew-mizielinski @taylor13 thanks for the chat today gents.
It would be useful to centralize discussions about augmenting the CVs into this repo, you will see a number of issues already exist, and @matthew-mizielinski has a prototype/placeholder branch to kick things off which we can iterate over.
@bnlawrence I had thought to drop your *.pdf attachment here, but might let you organize your content the way you like - feel free to submit a PR, we could start collecting stuff into a src or equivalent subdir if that makes sense.
Adding a link to Karl's 2019 google doc that outlines a framework suggestion moving forward
Just dropping a second link to a document CMIP6 Infrastructure Component Dependencies and Version Management Strategy that summarizes a number of the connections between infrastructure in CMIP6 - a good place to start from when thinking about what content is contained in the CVs and how this content serves other downstream services.