PCMDI / cmor

Climate Model Output Rewriter
BSD 3-Clause "New" or "Revised" License
53 stars 32 forks source link

Remove unused attributes when processing CMIP6Plus datasets #723

Closed mauzey1 closed 7 months ago

mauzey1 commented 10 months ago

As discussed in https://github.com/PCMDI/cmor/issues/718#issuecomment-1889697416

Some of these might be best as their own issues, but the small tweaks we've discussed are;

  • remove product field from output - currently defaults to model-output with the value at the top of a table overwriting this

  • avoid further_info_url automatic creation as this isn't going to be used beyond CMIP6 for now

  • (possibly) remove short_description from output -- it relies on mip_era from tables (now removed) and this field will do something strange when the CORDEX tables are picked up.

EDIT: The above attributes are to be removed when processing datasets for CMIP6Plus. CMIP6 datasets should retain these attributes.

To determine whether an attribute should be removed or not, we can check the CV file's required_global_attributes section. If attributes such as product and further_info_url are not present in required_global_attributes, then CMOR shouldn't add those attributes to the NetCDF.

mauzey1 commented 10 months ago

Through a combination of setting the default value of product in CMOR from model_output to an empty string and removing the product attribute from the headers of mip-cmor-tables I was able to remove the attribute from the netCDF file.

Is product an attribute that we plan to keep in the headers of mip-cmor-tables. If not, then removing it will make this part of the task easy. Otherwise, CMOR will need to not set the product value from the table header if the attribute is not in required_global_attributes. Do we want attributes that are not required by the CV to not be set even if the tables contain values for them?

taylor13 commented 10 months ago

product is meant to distinguish model_output from, for example, observations and forcing_dataset, so I think it should be included as a global attribute in the file and should be harvested by ESGF into its catalog. What would be the rationale for removing it?

mauzey1 commented 10 months ago

@taylor13 That would be a question for @matthew-mizielinski to answer.

mauzey1 commented 10 months ago

Are we all in agreement that further_info_url is an attribute should not be used beyond CMIP6?

taylor13 commented 10 months ago

o.k. by me.

durack1 commented 10 months ago

@mauzey1 yes, at least for CMIP6Plus, input4MIPs and obs4MIPs this is not supported by the ES-DOC/IS-ENES folks, so we need it removed as a required global attribute. For CMIP7, in a couple of years, it may reappear, but not for the immediate future

durack1 commented 10 months ago

product is meant to distinguish model_output from, for example, observations and forcing_dataset, so I think it should be included as a global attribute in the file and should be harvested by ESGF into its catalog. What would be the rationale for removing it?

Thanks for chiming in @taylor13. You make a very good point about having this to delineate between projects (or is it mip_era, or activity_id). We do need to figure out how to manage all this, if we had CVs that captured these options in the mip-cmor-tables (highest level "multi-verse") repo it would make more sense to me - so model-output, observations, forcing which was never defined in the CMIP6_CVs, in input4MIPs and obs4MIPs we have derived, observations, reanalysis

taylor13 commented 10 months ago

Yes, projects/activity_id/mip_era have sometimes been used to indicate something about "project", and under each project we can have multiple output types. (In CMIP5 we distinguished between output1 and output2, but folks found that confusing.). The terms should be clearly defined in CVs, but the allowed values would be project-specific.

mauzey1 commented 9 months ago

@durack1 @taylor13 Can somebody explain the reason behind this section of cmor_setGblAttr? https://github.com/PCMDI/cmor/blob/c2f7450cbbc20e9736fe26603e6b26dfbf66fd00/Src/cmor.c#L3049-L3056 The source_id and further_info_url check is done here if the further_info_url value of the dataset isn't an empty string. By default, CMOR sets the the value of further_info_url to a built-in template value. Unless the user explicitly sets further_info_url to be an empty string, these checks will happen. Shouldn't the source_id check happen regardless of what further_info_url is set to?

taylor13 commented 9 months ago

Yes, I think the check on source_id should always be done. Perhaps it is done elsewhere and then repeated here (for no reason I can think of) when there is a further_info_url defined. There's no reason to check it twice.

durack1 commented 9 months ago

I agree with @taylor13. We might have a further_info_url in the CMIP7 project, it depends if the ES-Doc folks manage to get supported in time for the specs to be defined. As it is, this would be a relatively trivial additional to the project CVs, but it would be useful to think about what structure we would need to put in place so that such attributes could be validated by CMOR in a generic way - i.e. a template that requires certain info to be defined in the CVs to check against

taylor13 commented 9 months ago

Yes, each new project (i.e., each CMIP phase, obs4MIPs phase, and input4MIPs phase, where phase implies a new set of required attributes) could "tell" CMOR which of its global attributes are required (and which of those are constrained by CVs or by templates), and then CMOR could check that the attribute had been defined (and that it conformed to the CV or template).

durack1 commented 9 months ago

@matthew-mizielinski @wolfiex can you both take a quick pass over the above, just to make sure no obvious issues are not being considered? @mauzey1 is nearing finalizing a CMOR 3.8.0 release, so wanted to double check whether we're good to start testing with the current main branches in CMIP6Plus_CVs and mip-cmor-tables.

mauzey1 commented 7 months ago

Marking this complete since the automatic creation of the further_info_url attribute has been removed, and the other attributes listed can be handled by the CV's required_global_attributes.