Rationale for new CMOR tables

@durack1 @matthew-mizielinski - Before Dan gets into trying to clean up and implement a new set of CMOR tables, I think it must be made clear what the rationale for and consequences of doing this are. What was wrong with the CMIP6 tables and why will the new tables be better? The following questions come to mind, which I think should be addressed before proceeding:

What is the purpose of grouping variables in tables? And what are the advantages of the new tables compared with the old tables?
a. Would table names continue to be used to uniquely label variables (root name + table name)?
b. Would table names be used as search facets? In file names? In directory structures?
If the new tables are used to uniquely label variables, won’t that mean modeling groups having to do quite a bit of work (in their processing procedures) to translate from the old CMIP6 labels to the new labels? When MIPs leaders already familiar with the CMIP6 tables develop new lists of requested variables, won’t they have to consult a look-up table to see what the new table name is?
How will it be decided which “realm” a variable belongs to? Some variables, for example, play a critical role in both atmospheric chemistry and aerosols. (Does SO2 concentration belong in the chemistry or aerosol realm? What about radiation which is important for dynamics, chemistry, and aerosols?). What realm do variable representing transfer processes from one realm to another belong to? (Does surface evaporation belong to the atmosphere, the ocean, the land, or what?) How do we know what table to put a variable in if there are no objective rules for defining its realm? (I have spent many hours trying to come up with a manageable set of rules for objectively determining a variable’s realm without success.). I don’t think we should continue to base our future handling of variables for MIPs on realms that are somewhat arbitrarily defined.
Is the new approach for defining CMOR tables scalable? Currently there are about 75 tables proposed, but I think that over time the number might easily double and potentially grow much larger. If I can count accurately, we have 9 “realms” (AC, AE, AP, OB, OP, LP, GI, Li, SI), 9 “frequencies” (subhr, hr, 3hr, 6hr, day, mon, yr, dec, fx), 4 “temporal sampling” options (avg, point, monClim, monDiurnal), and 3 “spatial” options (normal, zonal-mean, model-level). That means potentially 9x9x4x3=972 tables.
When a variable appears in more than one table (e.g., air temperature, sampled at different frequencies), how will we ensure that its metadata (besides “frequency”) is identical across all the tables? The checking process would seem to be labor intensive. Moreover, if someone proposes the addition of a new variable (say, an atmospheric monthly mean field), they could have to check more than a dozen tables to make sure the variable doesn’t already appear in one of the other tables. This is necessary because if the same quantity already appears in one of the other tables, then certain metadata should be defined consistently with that of the existing variable. Again, checking this would be labor intensive.
New tables would prevent publication on ESGF of new data with CMIP6 data. Is that o.k.?

@taylor13, a very quick reply. The goal of new tables is to weed out the logical inconsistencies that exist in the existing CMIP6 tables (E*), in addition to folding all the existing used tables (input4MIPs, obs4MIPs, ...) into a single, centralized entry to enable use across projects. This then means that across projects the evolution of the quantities and their organization can be centrally managed, and ensure that a tas across projects is the same quantity, with the same associated information, but customized for the project (ala project CVs)

I understood generally that purpose, but I think we need to explore the consequences by answering the specific questions posed.

@durack1 @matthew-mizielinski - Before Dan gets into trying to clean up and implement a new set of CMOR tables, I think it must be made clear what the rationale for and consequences of doing this are. What was wrong with the CMIP6 tables and why will the new tables be better? The following questions come to mind, which I think should be addressed before proceeding:

First up, I'm answering mostly in the context of CMIP6Plus here rather than CMIP7. File naming and DRS structure should be a project specific decision (consider CORDEX with different naming requirements and directory structures). CMIP6Plus is going to have to use CMOR 3.7 compatible tables in order for us to be able to do anything in the near future so we can develop the underlying tables separately providing that tools to export "legacy" table sets from mip tables plus CVs.

What is the purpose of grouping variables in tables? And what are the advantages of the new tables compared with the old tables?

We are very used to tables and have infrastructure built up to work with them. The main purpose of the new tables is to re-arrange variables such that we can logically add new variables to serve the needs of other projects.

a. Would table names continue to be used to uniquely label variables (root name + table name)?

Yes. When coming up with a variable list this is the simplest way to specify a variable.

b. Would table names be used as search facets? In file names? In directory structures?

Yes x 3 provided this is useful to the user, but again it depends what the project involved wants.

If the new tables are used to uniquely label variables, won’t that mean modeling groups having to do quite a bit of work (in their processing procedures) to translate from the old CMIP6 labels to the new labels? When MIPs leaders already familiar with the CMIP6 tables develop new lists of requested variables, won’t they have to consult a look-up table to see what the new table name is?

Yes. For me this isn't a dramatically difficult task, but the same may not be true of other groups. We refer to lookup tables anyway when constructing variable lists

How will it be decided which “realm” a variable belongs to? Some variables, for example, play a critical role in both atmospheric chemistry and aerosols. (Does SO2 concentration belong in the chemistry or aerosol realm? What about radiation which is important for dynamics, chemistry, and aerosols?). What realm do variable representing transfer processes from one realm to another belong to? (Does surface evaporation belong to the atmosphere, the ocean, the land, or what?) How do we know what table to put a variable in if there are no objective rules for defining its realm? (I have spent many hours trying to come up with a manageable set of rules for objectively determining a variable’s realm without success.). I don’t think we should continue to base our future handling of variables for MIPs on realms that are somewhat arbitrarily defined.

To be honest, I don't see users of the data struggling with this much. I would expect them to first search for the variable name and see what variables are defined (i.e. look for appropriate spatial shape / frequency). When it comes to assigning a realm maintaining consistency with other similar variables where possible are would be my temptation. The realm is a search facet on esgf, but I don't think this is one that I have used.

Is the new approach for defining CMOR tables scalable? Currently there are about 75 tables proposed, but I think that over time the number might easily double and potentially grow much larger. If I can count accurately, we have 9 “realms” (AC, AE, AP, OB, OP, LP, GI, Li, SI), 9 “frequencies” (subhr, hr, 3hr, 6hr, day, mon, yr, dec, fx), 4 “temporal sampling” options (avg, point, monClim, monDiurnal), and 3 “spatial” options (normal, zonal-mean, model-level). That means potentially 9x9x4x3=972 tables.

Yes, this could mean lots of tables. While I would prefer a smaller set, I don't necessarily see a problem with having a lot of tables as long as the logic is reasonably clear. I don't dig into the JSON tables themselves other than when debugging -- I tend to work with large searchable tables such as this when I want to investigate particular records.

When a variable appears in more than one table (e.g., air temperature, sampled at different frequencies), how will we ensure that its metadata (besides “frequency”) is identical across all the tables? The checking process would seem to be labor intensive. Moreover, if someone proposes the addition of a new variable (say, an atmospheric monthly mean field), they could have to check more than a dozen tables to make sure the variable doesn’t already appear in one of the other tables. This is necessary because if the same quantity already appears in one of the other tables, then certain metadata should be defined consistently with that of the existing variable. Again, checking this would be labor intensive.

This is something that needs looking at, but which can be done in slower time than initiating a CMIP6Plus mip era. In the first instance we could use a unit test framework to confirm consistency -- this is relatively simple to set up. Alternatively we could migrate data that is common to a separate document, which would look a lot like the MIPVariable entries in the CMIP6 DR, but again until we start on a CMOR4 version that works of different table structures I don't see that this needs finalising.

New tables would prevent publication on ESGF of new data with CMIP6 data. Is that o.k.?

We could go back to CMIP6 tables, but then we are constrained in how we add new variables and what do we do with other projects such as obs4mips/input4mips.

This is good information. Thanks, Matt, for your usual careful thinking on this. Given what you’ve said, I propose the following. (Sorry that the indentation isn't preserved from my original document.)

My understanding is that the requirements are:

To continue to organize variables in tables (and make the tables more consistent than in CMIP6)
Any modifications to the CMOR tables must be accompanied by a modified version of CMOR that can interpret them (and this must be implemented on a timescale of a month, or so).
When the same variable is used by multiple projects, the root name and accompanying metadata should all be identical. (The so-called harmonization of variables across projects.)
When the same quantity is sampled/reported in different ways, a certain subset of metadata must be consistent for all its variants.
It should not be too difficult to propose and add new variables. The information needed to create the required metadata should not be obscure. It should be easy to determine what table a variable belongs in.
Filenames within a project should be unique, so the names must include a way of distinguishing among closely related variables.

My understanding is that it is not important that the file names remain consistent with CMIP6.

I don’t think we need to harmonize tables across projects, just variables. I do think that the “root” name for a quantity should not be used to distinguish among variables of

Different frequencies
Different temporal sampling requirements (e.g., “point” vs. “mean” vs. “climatology”)
Different spatial reporting requirements (e.g, the same variable reported as zonal means vs. full latxlon distribution, or reported on model vs. pressure levels, or reported on multiple different single levels (e.g. p100, p300, p850, etc.), or reported for a particular realm (e.g. where sea ice vs. where sea)).

If my understanding is correct, I suggest we proceed in two steps as follow:

Near term:

We should place all variables in a single master list, organized as in pages 16-21 of this document. I have already made it possible to automatically create this list using the 2062 variables in the CMIP6 requested output. We will need to add the variables requested by other projects (input4MIPs, obs4MIPs, CORDEX, etc.).
For CMIP6plus we should:

Create a set of tables, organized in some sensible way), but structure those tables not as CMOR tables but as “unformatted MIP tables” with just the variable branded names listed. Two options are to organize:
[ ] as you’ve proposed in the github repository), or
[ ] in the same groupings as CMIP6 so the table names would be unchanged.
Decide how to group variables needed for input4MIPs, obs4MIPs, CORDEX, CMIP6plus, etc. Consider
[ ] Organizing the variables as in the previous step above and simply extend those tables.
[ ] Organizing variables differently in whatever groups seem appropriate for a particular project. The key is all the variables in a MIP’s group of branded variables would be drawn from the master list. This will mean that the variables are defined consistently across all projects.
For each project, create the CVs that CMOR and PrePARE rely on in their checks of metadata.
For each project, create the CMOR tables (JSON files readable by CMOR 1.7) by extracting for each branded variable in the unformatted MIP table the attributes recorded in the master list of MIP variables and recasting them in the CMOR format.
Decide how to define unique labels for each variable. Consider two options:
[ ] A compound construct of variable root plus table name.
[ ] A branded variable name.
Cease using tables as a search facet on ESGF. Instead, rely on the frequency, realm, temporal sampling (new facet; point, mean, or climatology), and vertical sampling (new facet; requested levels, model levels, or single level/no level).

Longer term:

The longer-term changes will depend somewhat on which of the near-term options are adopted.

If the unique branded variable labels are adopted, then future CMIP infrastructure can be built without consideration of how variables are grouped into tables. This will free constraints on CMOR tables while ensuring consistency across tables (because all variables will be drawn from the same master list). A MIP might group variables however they please (drawing exclusively from the master MIP table of variables). Modeling groups may also, if they wish, group the variables differently from the MIP groupings. A simple code could be constructed that, given a list of branded variable labels, would simply produce a CMOR-readable table with all the appropriate attributes.

It will be easy to implement the master list of branded variables because I have already done this, but not yet in the correct JSON file dictionary format. I estimate that someone familiar with python could easily take by excel spread sheet, which has all the needed attributes defined and construct the master list of MIP variables (which would only include on first pass the CMIP6 variables).

Thanks @taylor13 as per usual, your deep thinking on this is a pause for thought.

Regarding the per project search facets, this is what we have per project in the old COG configuration.

MIP	facets
CMIP3	Variable, Model, Experiment, Realm, Institute, Time Frequency, Ensemble
CMIP5	Project (provides alternate projects, e.g. EUCLIPSE, GeoMIP, LUCID, PMIP3, TAMIP), Product, Institute, Model, Experiment, Experiment Family, Time Frequency, Realm, CMIP Table, Ensemble, Variable, Variable Long Name, CF Standard Name, Datanode
CMIP6	MIP Era, Activity, Product, Source ID, Institution ID, Source Type, Nominal Resolution, Experiment ID, Sub-Experiment, Variant Label, Grid Label, Table ID, Frequency, Realm, Variable, CF Standard Name, Data Node
input4MIPs	MIP Era, Target MIP, Institution ID, Source ID, Source Version, Dataset Category, Variable, Grid Label, Nominal Resolution, Frequency, Realm, Data Node, Status
obs4MIPs	Source ID, Product, Realm, Variable, Variable Long Name, CF Standard Name, Data Node, CMIP5-era: Institute, Time Frequency; CMIP6-era: Institution ID, Frequency, Grid Label, Nominal Resolution, Region, Source Type, Variant Label

Moving forward the metagrid interface provides overview facets, which for the CMIP6 configuration lump identifiers together (e.g. Labels includes variant_label and grid_label):

Facet	Entries
General	Activity ID, Data Node
Identifiers	Source ID, Institution ID, Source Type, Experiment ID, Sub Experiment ID
Resolutions	Nominal Resolution
Labels	Variant Label, Grid Label
Classifications	Table ID, Frequency, Realm, Variable ID, CF Standard Name
Additional Properties	Version Type, Result Type, Version Date Range
Filename	filter by filename option

I am not familiar with how flexible this interface is. It would be useful to add the configuration information to the discussion

Dropping some links down here, as they are relevant for the discussion MetOffice ARISE CMOR Tables A Nomenclature Suitable for Uniquely Identifying CMIP Variables: Taylor et al. - page 16 defines the "json" format for a "master list" Guidelines for Defining MIP Variables: Taylor et al.

I believe this discussion and that in #26 would be useful to merge

PCMDI / mip-cmor-tables

Rationale for new CMOR tables #13