WCRP-CORDEX / cordex-cmip6-cv

Controlled Vocabulary (CV) for use in CORDEX
BSD 3-Clause "New" or "Revised" License
1 stars 5 forks source link

how to handle `required_global_attributes` depending on `activity_id` #19

Open gnikulin opened 11 months ago

gnikulin commented 11 months ago

Should we rename cordex to cordex-cmip6, both for the repository name and for all CVs ? Many CVs are defined only for CORDEX-CMIP6 and will be different in CORDEX-CMIP7.

sethmcg commented 11 months ago

I think it makes sense to rename it. CORDEX-CMIP6 is cumbersome, especially when said aloud. How about CORDEX6, in alignment with AR6 & CMIP6?

larsbuntemeyer commented 11 months ago

actually, we discussed this in #5 . I think, there is some more or less convention to have the activity_id be the filename prefix of the tables (e.g., compare to obs4MIPs, input4MIPs)...

gnikulin commented 11 months ago

I think it makes sense to rename it. CORDEX-CMIP6 is cumbersome, especially when said aloud. How about CORDEX6, in alignment with AR6 & CMIP6?

It has been decided to use "CORDEX-CMIP6" for this activity. Indeed, it's a bit cumbersome but provides good and clear description. From the CORDEX experiment design for dynamical downscaling of CMIP6 (https://cordex.org/wp-content/uploads/2021/05/CORDEX-CMIP6_exp_design_RCM.pdf):

In addition to the continental-scale downscaling, addressed in this document, CORDEX includes many other components. For example, the Flagship Pilot Studies (FPS) and regional workshops for climate and VIA communities. CORDEX is a continuous activity that is not divided into phases (1st, 2nd, etc.) and not necessarily related to the CMIP cycles. The framework described in this document is simply referred to as CORDEX-CMIP6.

gnikulin commented 11 months ago

actually, we discussed this in #5 . I think, there is some more or less convention to have the activity_id be the filename prefix of the tables (e.g., compare to obs4MIPs, input4MIPs)...

I'm not sure that there are build rules for the file name of the tables. Will the input4MIPs tables have the same names (without mip_era) as now for CMIP7? Other activities don't have their own table at all, e.g. ScenarioMIP etc.

CORDEX is not a CMIP6 project or activity that contributes to CMIP6 and here we have more freedom to define what's better for CORDEX. Regarding activity_id, it was suggested that in CORDEX-CMIP6 activity_id is "an identifier of different CORDEX activities as dynamical downscaling, empirical- statistical downscaling, Flagship Pilot Studies and bias adjustment (e.g. “RCM”, “ESD”, “FPS”, “Adjust”)".

Currently we have CORDEX_source_id.json assuming only RCMs as a source. However, when we are going to register ESD methods we need to distinguish this ESD source table from the RCM one, another level of complexity :-). Perhaps we may even need to add the CORDEX-CMIP6 activity_id to some CV file names, something like CORDEX-CMIP6_RCM_source_id.json, CORDEX-CMIP6_ESD_source_id.json etc. ?

sethmcg commented 11 months ago

@gnikulin - That makes sense. CORDEX-CMIP6 it is, then.

With regard to activity_id, I think we need to allow for additional cases. For example, one project I'm involved with aims to include some variable-resolution simulations in the mix for comparison with RCM downscaling. There are also efforts to train Machine Learning models to emulate RCMs. Those both will require expanding the CORDEX_source_id.json file, and in the case of ML methods, I think you have two sources: both the ML setup and the RCM it was trained to emulate. (Or possibly even multiple RCMs, if that proves feasible.)

gnikulin commented 11 months ago

I would include both limited area RCMs and VR-GCMs in the same "RCM" source_id file. There is the global attribute source_type which provides a short description of model configuration (e.g. “RCM”, “AGCM”, “RESM”, "VR-GCM", etc., all acronyms should be defined). This information can also be requested during the registration.

Regarding ML methods, I consider them as some kind of ESD and suggest to include them to the "ESD" source_id file. Information about datasets (e.g RCMs) used for training ML methods should be reflected in metadata (global attributes), can be different for the same ML method (e.g. https://cordex.org/wp-content/uploads/2017/06/CORDEX_ESD_Experiment1.pdf) Here, it is necessary to get input from the ESD community.

Creating many CVs for specific cases may make the CORDEX data infrastructure too complex. I would vote for the simplest solution.

jesusff commented 11 months ago

Promoting specific values (RCM, ESD, ...) of the controlled vocabulary to the filenames seems to break the general build rules for these files. I see no problem in merging all source_id's under a single file, given that we add the source_type to each source. Also, having different model_components or required_global_attributes depending of the source_type should not be a problem. It would be a matter of having a new CORDEX[-CMIP6?]_model_component.json listing the expected components (or global attributes, for the existing CORDEX_required_global_attributes.json) for each source_type. These two files (CORDEX_model_component.json (maybe in plural, components?) and CORDEX_required_global_attributes.json) would kind of define the source_type. Each new source_type created should add its defining components and attributes to those files.

(this thread has gone a bit off-topic from the original post)

larsbuntemeyer commented 11 months ago

Yes, i agree that is sufficient to use source_type to distinguish different types of downscaling methods (dynamic, statistical, ml) so that all types of downscaling sources can go into one source_id table.

For the required attributes to register a source_id (https://github.com/WCRP-CORDEX/cordex-cv/issues/4) i wouldn't make the model components a requirement but only the most basic ones, e.g., source_id, source, release_year, institution_id.

required global attributes

I am unsure about the required_global_attributes. Having different required_global_attributes depending on activity_id would require like a new CV table for each activity (in the end, everything ends up in one CV table). So, in the past, the distinction was made through product and project_id aka activity_id. I could imagine, e.g., having ML and bias adjustment models producing bias-adjusted or ml-adjusted output if they are based on output of RCM (dynamic) models. bias adjustment also had different variable names, e.g., tasAdjust instead of tas. There can still be additional attributes of course, e.g., like in ESD there was bias_adjustment but i would not make them required global attributes.

Second option would be to have another set of tables for bias adjustment and bias adjust based on the common CV in this repo. For example, for bias adjustment there would another repo of tables with the same filenames (CORDEX-CMIP6_CV.json, CORDEX-CMIP6_mon.json, etc...) but tailored for bias adjustment and if necessary adjusted output variable names. I wonder how that was done in the past since, at least, the cordex cmip5 cmor tables contain no hint on adjusted output. I guess, it was done by adjusting those tables?

jesusff commented 11 months ago

I would say different activities (see also #20) would need to define their own CV and tables, based on these general ones. It should be a matter of degenerating the CORDEX_activity_id.json to a single value, reducing other elements (e.g. CORDEX_domain_id.json) and generally adapting the rest of the elements (including the required_global_attributes) and tables to the protocols of the particular initiatives. The question is also if here we want to encompass all activities or just focus on providing the example for the dynamical downscaling on continental domains (activity_id = RCM). Even for the continental-scale domains, different domains are managing different output variable lists; mainly based on the general CORDEX one, but removing (no problem) or adding some variables. So, even for the same activity_id a separate set of tables might be needed.

larsbuntemeyer commented 11 months ago

Yes, i agree with @jesusff, other activities might have different requirements for their vocabulary that we don't even know about yet. And from my experience, users will mess with the tables anyway. The important thing is to have a vocabulary that can be used for QA for ESGF publication, althoug we don't even have a checker yet :persevere: (PrePARE does only work for CMIP6)

Thanks for opening #20 !

gnikulin commented 11 months ago

If we are back to the original post :-). Should we use CORDEX-CMIP6 for all tables and CVs instead of simply CORDEX ? My concern is that when we come to CORDEX-CMIP7 it's a bad practice to use the same file names for files with different content.

gnikulin commented 11 months ago

Promoting specific values (RCM, ESD, ...) of the controlled vocabulary to the filenames seems to break the general build rules for these files. I see no problem in merging all source_id's under a single file, given that we add the source_type to each source. Also, having different model_components or required_global_attributes depending of the source_type should not be a problem. It would be a matter of having a new CORDEX[-CMIP6?]_model_component.json listing the expected components (or global attributes, for the existing CORDEX_required_global_attributes.json) for each source_type. These two files (CORDEX_model_component.json (maybe in plural, components?) and CORDEX_required_global_attributes.json) would kind of define the source_type. Each new source_type created should add its defining components and attributes to those files.

(this thread has gone a bit off-topic from the original post)

OK, we can try to merge all source_id's under a single file and distinguish them by source_type.

sethmcg commented 11 months ago

For the required attributes to register a source_id (#4) i wouldn't make the model components a requirement but only the most basic ones, e.g., source_id, source, release_year, institution_id.

Does the source_id identify the model / method used to perform the downscaling? If so, I'm not sure that release_year and institution_id are well-defined for methods that aren't RCMs. For example, what would they be for the (simplistic but still widely-used) ESD method of interpolation + bias-correction?

Should we use CORDEX-CMIP6 for all tables and CVs instead of simply CORDEX?

CORDEX-CMIP6 makes sense for exactly the reason you give, that we don't want ambiguity if/when we do this again later on.

larsbuntemeyer commented 11 months ago

If we are back to the original post :-). Should we use CORDEX-CMIP6 for all tables and CVs instead of simply CORDEX ? My concern is that when we come to CORDEX-CMIP7 it's a bad practice to use the same file names for files with different content.

OK, agreed, i'll rename them!

gnikulin commented 10 months ago

Second option would be to have another set of tables for bias adjustment and bias adjust based on the common CV in this repo. For example, for bias adjustment there would another repo of tables with the same filenames (CORDEX-CMIP6_CV.json, CORDEX-CMIP6_mon.json, etc...) but tailored for bias adjustment and if necessary adjusted output variable names. I wonder how that was done in the past since, at least, the cordex cmip5 cmor tables contain no hint on adjusted output. I guess, it was done by adjusting those tables?

Regarding bias-adjusted variables, all modifications of their acronyms and long names are very simple and described in the DRS for bias-adjusted CORDEX simulations http://is-enes-data.github.io/CORDEX_adjust_drs.pdf

by appending Adjust to the variable name DRS elements in file names and in NetCDF files: pr -> prAdjust, tas -> tasAdjust

long names (the long_name NetCDF attribute) have to be also modified by adding Bias-Adjusted in front of the long names Near-Surface Air Temperature –> Bias-Adjusted Near-Surface Air Temperature.

There were no specific CORDEX-CMIP5 CMOR tables for bias-adjustment.

gnikulin commented 10 months ago

I would say different activities (see also #20) would need to define their own CV and tables, based on these general ones. It should be a matter of degenerating the CORDEX_activity_id.json to a single value, reducing other elements (e.g. CORDEX_domain_id.json) and generally adapting the rest of the elements (including the required_global_attributes) and tables to the protocols of the particular initiatives. The question is also if here we want to encompass all activities or just focus on providing the example for the dynamical downscaling on continental domains (activity_id = RCM). Even for the continental-scale domains, different domains are managing different output variable lists; mainly based on the general CORDEX one, but removing (no problem) or adding some variables. So, even for the same activity_id a separate set of tables might be needed.

Actually, there is no need to create new CMOR tables for different domains if output variable lists are different. All variables should be include in the CORDEX-CMIP6 CMOR tables and each domain post-processes only a subset of them.

larsbuntemeyer commented 5 months ago

I agree, so maybe we can setup also a simple registration process for the data-request table (just give variable, frequency and some basic details) from which the tables are updated. It would be much nicer since converting from google spreadsheets is a pain.

gnikulin commented 5 months ago

Yes, it's a good idea. The atmospheric variable spreadsheet was the first human-readable step to discuss what variables should be archived. Adding new variables indeed can be done more efficiently with a registration process for the data request table. There is a ocean variable list (https://doi.org/10.5281/zenodo.8207553), again a spreadsheet :-), that should be included.

larsbuntemeyer commented 5 months ago

OK, I can add the ocean variables, still have the script that can tackle the spreadsheets, seems to be similar format...

gnikulin commented 5 months ago

The format should be the same as the atmospheric variable spreadsheet was used as a template. The ocean list published in zenodo is pdf but there is a goggle spreadsheet as well.

gnikulin commented 5 months ago

There is also lists with aerosol variables (https://doi.org/10.5281/zenodo.7112860) and river ones (https://doi.org/10.5281/zenodo.7112673) should be checked to avoid duplication.