PCMDI / cmor

Climate Model Output Rewriter
BSD 3-Clause "New" or "Revised" License
51 stars 33 forks source link

Cleaning up CMOR input file vs registered content sources #628

Closed durack1 closed 2 years ago

durack1 commented 3 years ago

@mauzey1 I have been trying to simplify my CMOR_input.json file along with the details (that are currently duplicated) in the registered content, found in the input4MIPs_CV.json file.

My aspiration, was that all information that is contained in the registered content (input4MIPs_CV.json) would be the primary source, and only additional/runtime info e.g.

Screen Shot 2021-09-02 at 1 28 28 PM link

Would be included in the CMOR_input.json file, whereas all registered content would be used by default, and if there was an additional definition of the variable with the same name in the CMOR_input.json then this would overwrite the input4MIPs_CV.json sourced entry.

Can you comment on this?

@taylor13 ping

durack1 commented 3 years ago

@mauzey1 specfically I have attempted to remove the variables: calendar, contact, dataset_category, further_info_url, grid_label, nominal_resolution, region and source from the CMOR_input.json file and CMOR 3.6.0 complains, even though these are read from the PCMDI-AMIP-1-2-0 source_id registered in input4MIPs_CV.json

durack1 commented 3 years ago

I vaguely remember the code dealt with this in a fairly clumsy way, but for a user it would seem that the information being fed to CMOR is somewhat arbitrary, so some variables can existing in the registered content (input4MIPs_CV.json) alone, whereas other variables MUST exist in the input file (CMOR_input.json) even if these are duplicated

durack1 commented 3 years ago

I've just stumbled across an interesting situation where I have removed my frequency from the CMOR_input.json, added this to the registered content (input4MIPs_CV.json) and now CMOR ignores the value of "frequency":"fx", in the table file that is being called to write data, this seems like an incorrect behavior - in the case of a variable defined in a table (in this case input4MIPs_Ofx.json) the priority should always be for the table value.

So in order, information (variables defined) should be overwritten from highest to lowest priority (so highest priority data sources are on the left, to the lowest on right [input4MIPs_CV.json]): input4MIPs_Ofx.json/table input <-| CMOR_input.json/user input <-| input4MIPs_CV.json/registered content

added: It seems neither of these commands ensure that the right frequency information is loaded

    cmor.set_cur_dataset_attribute('frequency', 'fx')  # <-- not effective
    table = 'input4MIPs_Ofx.json'  # <-- doesn't overwrite source_id:frequency value
mauzey1 commented 3 years ago

@durack1

So the CV file will be the default for input values unless set differently in the input file, but the variable's table entry will always precedence?

Should this behavior also apply to the CMIP6 CV and variable tables?

Where else is the calendar parameter defined aside CMOR_input.json? I don't see it anywhere else. Does it default to gregorian?

durack1 commented 3 years ago

@mauzey1 yeah sorry, I haven't pushed all my latest versions to github, (have now done and https://github.com/PCMDI/cmor/issues/628#issue-987134993 links now point to the branch) - the latest version on disk of the CV file does contain calendar. Regarding defaults, I presume that gregorian would be a reasonable default, but to be honest, considering CMOR is most often used to write model data, it may be something that we really don't want to assume and define default values

durack1 commented 3 years ago

@mauzey1 if you wanted me to hand over a test env (including data, table files, input json etc), happy to provide you a directory(ies) to point to

mauzey1 commented 3 years ago

@durack1 Yes, the test environment would be helpful to have. Thank you.

durack1 commented 3 years ago

@mauzey1 do you have access to detect, or which machine would you prefer? As we now have all our home co-located, this should be as easy as divulging a path and we're off - can you see /home/[me]? and /work/[me]

mauzey1 commented 3 years ago

I don't have access to detect. Could you put the env in a zip file and and email it to me?

taylor13 commented 3 years ago

Regarding https://github.com/PCMDI/cmor/issues/628#issuecomment-912127523, I favor not providing a default value for calendar for CMIP. We would like to force the modelers to tell us what their calendar actually is (and many do not use gregorian. [I guess an error should be raised if they fail to supply calendar.

matthew-mizielinski commented 3 years ago

Regarding #628 (comment), I favor not providing a default value for calendar for CMIP. We would like to force the modelers to tell us what their calendar actually is (and many do not use gregorian. [I guess an error should be raised if they fail to supply calendar.

Absolutely -- I'm still getting queries from users who don't quite understand non-gregorian calendars when we have got this correct in our data. Providing a default seems a little risky to me.

durack1 commented 3 years ago

@mauzey1 attached is a standalone demo, this should get us cooking with code that can elucidate the input data source priorities.

The input4MIPs-cmor-table symlink, will need to be cloned from https://github.com/PCMDI/input4MIPs-cmor-tables (my current working branch, that I used to test the archive below is issue87_durack1_PCMDI-AMIP-1-2-0AndPy3Update)

Let me know if this is working for you, I've been testing against CMOR 3.6.0. python sanitizeTest.py should get you running, the env will need cdms2, cmor, and pytz installed in addition to standard python libraries

210909_durack1_CMORTesting-issue628.zip

durack1 commented 3 years ago

Depends on https://github.com/PCMDI/cmor/issues/629

durack1 commented 2 years ago

Hi @mauzey1 let me know when you have time to start working on this, and I'll commit to pulling down a pre-release for testing

mauzey1 commented 2 years ago

@durack1 I will try to get this feature ready for testing sometime this week. I will keep you informed.

durack1 commented 2 years ago

@mauzey1 perfect!

durack1 commented 2 years ago

@mauzey1 I am starting to get close to having final data for writing, is there a demo/nightly 3.7.0 ready to roll yet?

mauzey1 commented 2 years ago

@durack1 I have so far made changes to cmor_load_table that will load the following the attributes with default values from the CV file.

I have a question about the region attribute. The region value is in a JSON list rather than just a string. Does this mean there could be multiple regions for a source_id?

I have so far only tested this change using the input4MIPS test but I could make a CMIP6 test for CircleCI testing.

I plan to integrate these changes before the end of this week.

durack1 commented 2 years ago

@mauzey1 I am wondering whether we need to talk about this, as I am a little confused by the questions above.

The issue that I was trying to highlight is that the source (or the information source) of the information is what needs prioritizing, such that we load information in order, and overwrite existing variables as we go.

So for example, if contact = bob@thebuilder.com was found in the input4MIPs_CV.json/registered content file this would be loaded into memory. If the CMOR_input.json/user input file then contained the same variable with a different value contact = bobby@thebuilder.com, then a warning would be thrown and the existing instance of contact would be updated by the higher priority user input.

In the case that a variable was defined in the table input4MIPs_Ofx.json/table input, then this would overwrite the existing variable (in the example above contact (which is not included in the table definitions, but another variable, e.g. frequency would be for example).

So the intention, is to gather as much information from the registered content/CVs, so that a user can provide only limited information (not registered in the CVs) in the user input, BUT also importantly allow them to overwrite information at runtime.

Is this clear? I hope so

durack1 commented 2 years ago

@mauzey1 regarding the specific region question, for input4MIPs we have no cases of multiple region values (see here), but this is not true for obs4MIPs for e.g. (see here)

mauzey1 commented 2 years ago

@durack1

Okay. So we initialize the attributes of a variable using the CV/registered content, but if the user input has that attribute we overwrite and throw a warning? I think I have the part that sets the attributes with the CV/registered content already done. I would just need to make something that throws the warning if the value is overwritten by the user input. Is the list of attributes I posted that apply for this correct?

So the tables would overwrite user input? Should that also throw a warning about overwriting?

durack1 commented 2 years ago

@mauzey1 do we need to define a list or variables/attributes? The information will be slightly different depending on the project, so CMIP6/ vs input vs obs4MIPs etc. If it is possible, a general if variable exists, is overwritten, throw warning

The tables should always be the definitive source of information, so if the table file defines it, then it overwrites all else regardless of the source (and throws a warning to alert the user they are doing bad things). At this point, it may be worth throwing an error and aborting, rather than a warning and continuing - @taylor13 @matthew-mizielinski any perspectives to share here?

mauzey1 commented 2 years ago

@durack1

Okay. I think what we could is go through the list of required_global_attributes to see what is required that is listed in the source_id in the CV. This would eliminate the need to list the attributes since they are already listed in the CV.

Regarding the multiple region values in https://github.com/PCMDI/cmor/issues/628#issuecomment-960127709, would we store a list of regions in this attribute or just pick one? Is there an example of multiple regions being stored in a file? The rest of the attributes are easy since they are just one string but I wanted to know how to handle the list-of-strings case.

durack1 commented 2 years ago

@mauzey1 the above 1 (go through the required_global_attributes) is exactly the right approach, however, if there are additional entries that are "optional" in the source_id then these should be loaded too, and written to the file in the case that no other instances exist in the higher priority input sources (input json, and table json).

Regarding region in obs4MIPs we do have the regions identified in PCMDI/obs4MIPs-cmor-tables/obs4MIPs_region.json, and if this was included in other table/CV repos then the same listing of valid entries would be necessary (and then propagate into the **PROJECT**_CV.json file

mauzey1 commented 2 years ago

In that case, should we just go through entire list of attributes for a source_id entry to include both required and optional attributes? I'm guessing we can skip the source_variables attribute since it's just there to list which variables are part of the source, right?

I don't think that really answers my question. I was asking if we should write a list of regions into a file's region global attribute if we encounter multiple regions listed for a region attribute in the source_id entry of a CV file.

durack1 commented 2 years ago

In that case, should we just go through entire list of attributes for a source_id entry to include both required and optional attributes? I'm guessing we can skip the source_variables attribute since it's just there to list which variables are part of the source, right?

Exactly, I would recommend that we include all the registered/CV entries, including source_variables because they are in the CV. As an aside, is there a way to overwrite/remove such a variable in the cmor_input.json - so a way to clobber the CV/registered values at run time? That would be useful. Would setting these to None be a way to do it?

I don't think that really answers my question. I was asking if we should write a list of regions into a file's region global attribute if we encounter multiple regions listed for a region attribute in the source_id entry of a CV file.

That is a good question. In this case, I would write whatever is provided, so in the case that the CV has region specified, but the cmor_input.json doesn't, then I would write the CV attribute. For most users, they could then control this attribute by overwriting the value in the cmor_input.json file, clobbering the CV sourced value

Does that answer the remaining questions?

taylor13 commented 2 years ago

If I understand it, you plan to write as global attributes everything that is in the source_id section of the input4MIPs_CV.json file. That might be o.k. for input4MIPs, but for CMIP6, we don't want CMOR to write most of the information in the source_id section of the cmip6_CV.json file. Isn't that right @durack1?

durack1 commented 2 years ago

@taylor13 exactly. For CMIP6, the information that is contained in the PCMDI/cmip6-cmor-tables/CMIP6_CV.json/source_id entry is curated, so that it is a subset of the info in the CMIP6_CVs, so for e.g. ACCESS-CM2:

https://github.com/PCMDI/cmip6-cmor-tables/blob/f35e08541ed75201e0063ddec58daa361b9dfa5e/Tables/CMIP6_CV.json#L140-L158

vs the original/registered info (in the WCRP-CMIP/CMIP6_CVs/CMIP6_source_id.json):

https://github.com/WCRP-CMIP/CMIP6_CVs/blob/7fb62c418c8323bbda6cc327bbd1523e8fd810f5/CMIP6_source_id.json#L52-L106

taylor13 commented 2 years ago

Got it. But I think we should then filter out "cohort". We don't want CMOR to write that as a global attribute because its status will change over time (e.g., from "registered" to something else (can't remember) once all the DECK experiments have been completed.

durack1 commented 2 years ago

Got it. But I think we should then filter out "cohort". We don't want CMOR to write that as a global attribute because its status will change over time (e.g., from "registered" to something else (can't remember) once all the DECK experiments have been completed.

Agreed, I wonder if we need to create a new excluded_global_attributes which means we can control such things in the CMIP6_CV.json file, we do have the required_global_attributes

To be honest, we may just want to remove cohort from the entry, but having said that, it's already part of most of the files that have used the default CMOR3 configuration - too late to change?

taylor13 commented 2 years ago

Is it really true that CMOR currently writes all global attributes defined in CMIP6_CV.json? I thought it only wrote the required ones. In any case I think we definitely do not want to include "cohort" in future files written by CMOR. The fact that current files might have flaws doesn't mean we don't want to eliminate those in future files.

durack1 commented 2 years ago

I'll take a look at the ACCESS-* output later today, I'm fairly sure they're using the CMOR3 pipeline with minimum edits, my guess is yes, unless an exclusion was made in CMOR many moons ago

durack1 commented 2 years ago

Looks like you did manage to get it excluded, at least in this example:

(base) -bash-4.2$ ncdump -h ~CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/Omon/tos/gn/v20191115/tos_Omon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_185001-201412.nc 
netcdf tos_Omon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_185001-201412 {
dimensions:
    time = UNLIMITED ; // (1980 currently)
    j = 300 ;
    i = 360 ;
    bnds = 2 ;
    vertices = 4 ;
...

// global attributes:
        :Conventions = "CF-1.7 CMIP-6.2" ;
        :activity_id = "CMIP" ;
        :branch_method = "standard" ;
        :branch_time_in_child = 0. ;
        :branch_time_in_parent = 21915. ;
        :creation_date = "2019-11-15T15:21:42Z" ;
        :data_specs_version = "01.00.30" ;
        :experiment = "all-forcing simulation of the recent past" ;
        :experiment_id = "historical" ;
        :external_variables = "areacello" ;
        :forcing_index = 1 ;
        :frequency = "mon" ;
        :further_info_url = "https://furtherinfo.es-doc.org/CMIP6.CSIRO.ACCESS-ESM1-5.historical.none.r1i1p1f1" ;
        :grid = "native atmosphere N96 grid (145x192 latxlon)" ;
        :grid_label = "gn" ;
        :history = "2019-11-15T15:21:42Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards." ;
        :initialization_index = 1 ;
        :institution = "Commonwealth Scientific and Industrial Research Organisation, Aspendale, Victoria 3195, Australia" ;
        :institution_id = "CSIRO" ;
        :mip_era = "CMIP6" ;
        :nominal_resolution = "250 km" ;
        :notes = "Exp: ESM-historical; Local ID: HI-05; Variable: tos ([\'sst\'])" ;
        :parent_activity_id = "CMIP" ;
        :parent_experiment_id = "piControl" ;
        :parent_mip_era = "CMIP6" ;
        :parent_source_id = "ACCESS-ESM1-5" ;
        :parent_time_units = "days since 0101-1-1" ;
        :parent_variant_label = "r1i1p1f1" ;
        :physics_index = 1 ;
        :product = "model-output" ;
        :realization_index = 1 ;
        :realm = "ocean" ;
        :run_variant = "forcing: GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4, CFC11, CFC12, CFC113, HCFC22, HFC125, HFC134a)" ;
        :source = "ACCESS-ESM1.5 (2019): \n",
            "aerosol: CLASSIC (v1.0)\n",
            "atmos: HadGAM2 (r1.1, N96; 192 x 145 longitude/latitude; 38 levels; top level 39255 m)\n",
            "atmosChem: none\n",
            "land: CABLE2.4\n",
            "landIce: none\n",
            "ocean: ACCESS-OM2 (MOM5, tripolar primarily 1deg; 360 x 300 longitude/latitude; 50 levels; top grid cell 0-10 m)\n",
            "ocnBgchem: WOMBAT (same grid as ocean)\n",
            "seaIce: CICE4.1 (same grid as ocean)" ;
        :source_id = "ACCESS-ESM1-5" ;
        :source_type = "AOGCM" ;
        :sub_experiment = "none" ;
        :sub_experiment_id = "none" ;
        :table_id = "Omon" ;
        :table_info = "Creation Date:(30 April 2019) MD5:e14f55f257cceafb2523e41244962371" ;
        :title = "ACCESS-ESM1-5 output prepared for CMIP6" ;
        :variable_id = "tos" ;
        :variant_label = "r1i1p1f1" ;
        :version = "v20191115" ;
        :cmor_version = "3.4.0" ;
        :_NCProperties = "version=2,netcdf=4.6.2,hdf5=1.10.5" ;
        :tracking_id = "hdl:21.14100/02850fcc-be64-40de-b7ca-9b8aa6e688a0" ;
        :license = "CMIP6 model data produced by CSIRO is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (https://creativecommons.org/licenses/).  Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment.  Further information about this data, including some limitations, can be found via the further_info_url (recorded as a global attribute in this file).  The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
}

We'll have to figure out where the exclusion is hard-coded, might be something else to clean up.

mauzey1 commented 2 years ago

@durack1

So far, I have made it so that the attributes in the source_id entry of the CV are used in the global attributes of the dataset unless the attributes are defined in the user input file. However, this lead to some issues with the attributes in the entries that were lists of strings such as region and source_variables. A part of CMOR will check if the required attributes' values are listed in the CV like the region values in input4MIPs_CV.json.

When I tried to create a string value that just all the region values listed in the source_id entry, CMOR treated it like as an error. I could just use the first value since the region attribute in input4MIPs variables just have one listed. However, there could be multiple values listed in other CVs. The source_variables attribute would be ignored since it is not considered a required attribute in the CV. I could make a special case for attributes like region where it checks if the value is either listed in the list, or is a list as defined in the source_id entry. However, that might run into an issue if the string exceeds the current string maximum length of 1024.

As for using the variable table values to overwrite the user input, would that be just for attributes listed in the table header or variable entry that are also listed in the required_global_attributes of the CV?

durack1 commented 2 years ago

@mauzey1 this sounds like progress, I am doubtful that region would exceed 1024 characters, but you never know!

Regarding the table values, we don't want any of this information to be overwritten, so in the case of Omon | tos:

{
    "Header":{
        "Conventions":"CF-1.7 CMIP-6.2",
        "approx_interval":"30.00000",
        "cmor_version":"3.5",
        "data_specs_version":"01.00.32",
        "generic_levels":"olevel olevhalf",
        "int_missing_value":"-999",
        "mip_era":"CMIP6",
        "missing_value":"1e20",
        "product":"input4MIPs",
        "realm":"ocean",
        "table_date":"14 September 2020",
        "table_id":"Table input4MIPs_Omon"
    },
    "variable_entry":{
        "tos":{
            "cell_measures":"area: areacello",
            "cell_methods":"time: mean",
            "comment":"",
            "dimensions":"longitude latitude time",
            "frequency":"mon",
            "long_name":"Sea Surface Temperature",
            "modeling_realm":"ocean",
            "ok_max_mean_abs":"",
            "ok_min_mean_abs":"",
            "out_name":"tos",
            "positive":"",
            "standard_name":"sea_surface_temperature",
            "type":"real",
            "units":"degC",
            "valid_max":"",
            "valid_min":""
        },

We want to ensure that all data using these tables conforms to this information verbatim, and no user-provided input can overwrite any of these details, throwing an error (and exiting) if this is attempted

mauzey1 commented 2 years ago

@durack1 Do you have any examples of the attribute values from the variable table being changed by the user input or via the API? I tried to change the Conventions attribute in the user input file for the input4MIPs example program but it doesn't change the global attribute value in the NetCDF file.

cmor_input.json

    "institution_id":               "PCMDI",
    "source_id":                    "PCMDI-AMIP-1-2-0",
    "Conventions":                  "CF-XXXXXXXXXX",

osbcs_input4MIPs_SSTsAndSeaIce_CMIP_PCMDI-AMIP-1-2-0_gn_187001-187512.nc

// global attributes:
                :Conventions = "CF-1.7 CMIP-6.2" ;
durack1 commented 2 years ago

@mauzey1 apologies for the tardy reply.

The code in https://github.com/PCMDI/amipbcs/blob/issue23_durack1_cleanupCMORinput/sanitize.py#L442 was how I was testing, and the last time I tested it overwrote the table loaded value (I think)

So the change noted above was changing a variable that had already been loaded from a json input file

P.S. We have just managed to finalize the data that I was hoping to write with CMOR3.7.0, so it'd be great to push on this issue and get it solved in a nightly CMOR version for testing

mauzey1 commented 2 years ago

@durack1

I'm still having issues with changing the attributes with either the user input file or the API. I tried inserting the line cmor.set_cur_dataset_attribute('frequency', 'fx') in the sanitizeTest.py file you provided previously.

# Force local file attribute as history
cmor.set_cur_dataset_attribute('history', history)
cmor.set_cur_dataset_attribute('frequency', 'fx')
table = 'input4MIPs_Omon.json'
cmor.load_table(table)

It doesn't seem to change anything since the frequency global attribute is still set to mon in the output file.

durack1 commented 2 years ago

@mauzey1 is the order of the inputs set in the code somewhere? That might be a way to understand what is defined and redefined/overwritten etc

mauzey1 commented 2 years ago

@durack1 So I think I have figured out where and how attributes get set/overwritten. Although you can set attributes in the user input or inside the code with the CMOR API, the function cmor_CV_checkGblAttributes, that is called in cmor_setGblAttr from cmor_write, will set the required global attributes to those found in the source_id entry of the CV.

When cmor_CV_checkGblAttributes is called, it will loop through all attributes listed in required_global_attributes of the CV file to check if the attributes are set. If they are not set, then it will throw an error. If set, then it will run the function cmor_CV_ValidateAttribute. cmor_CV_ValidateAttribute will either check if an attributes value is in a list in the CV, or populate the other attributes if the checked attribute is the key to another set of attributes in the CV file i.e. source_id. However, it won't set values for attributes that are list such as region.

I think the process for adding registered content should go as follows. At the beginning when the CV file is loaded, CMOR will loop through all of the attributes listed in the source_id entry. Afterwards, it will allow for the attributes to be modified by either the user input file or CMOR API calls. When CMOR reaches the cmor_CV_ValidateAttribute function, it will check whether there is a value already set for an attribute before attempting to set it. If the value differs from that of the CV's registered content, then a warning will be thrown indicating that the user has overwritten the registered content of the attribute.

As for changing the attributes that are listed in the header of the variable's table file, they seem to be unable to change no matter where I call set_cur_dataset_attribute in the sanitizeTest.py program. I tried to change the value of realm in a few places in the code but it still stays the same value as listed in the table header when written to file. It already seems pretty hard for a user to alter the table attributes via CMOR.

durack1 commented 2 years ago

@mauzey1 apologies for the tardy reply. What you outline above sounds perfect, I'll ping @taylor13 here just for a sanity check.

I believe my test was attempting to overwrite the frequency attribute after loading the table file, and before writing the data, but you've not been able to recreate my experience, so I should double check once we have a nightly version to test

taylor13 commented 2 years ago

@mauzey1 @durack1 : Re

I think the process for adding registered content should go as follows. At the beginning when 
the CV file is loaded, CMOR will loop through all of the attributes listed in the source_id entry. 
Afterwards, it will allow for the attributes to be modified by either the user input file or CMOR 
API calls. When CMOR reaches the cmor_CV_ValidateAttribute function, it will check whether 
there is a value already set for an attribute before attempting to set it. If the value differs from 
that of the CV's registered content, then a warning will be thrown indicating that the user has 
overwritten the registered content of the attribute.

I think this is about right, but not quite. Here is how I see it: 1) I think CMOR might only care about a subset of the information under each source_id entry in the CV file. (For example, CMOR might not care about "cohort".) 2) For most, if not all, attributes in the source_id CV, the user should not be permitted to make any changes when running CMOR. The CV is supposed to provide the reference information for the source. If something is incorrect in the CV, then it should be changed in the source_id CV itself prior to running CMOR. 3) For most, if not all, attributes in the CMOR tables, the user should not be able to make any changes when running CMOR. Again these tables represent the definitive source of information pertaining to a variable. 4) If a user's cmor input file or a CMOR API call attempts to change a value obtained from the source_id CV or from a cmor table, a fatal error should be raised.

Do either of you agree with me?

I also have a somewhat unrelated question: Does CMOR record activity_id based on the experiment_id CV table, which contains the activities associated with an experiment? If so, I don't think the user should be able to override the value in the CMIP6_input.json input table to cmor.

taylor13 commented 2 years ago

point 3 above is not quite right. The global attributes and most of the variable attributes set by the CMOR tables should not be overridden. There is some variable-specific information that we should allow to be overridden (e.g., the value of a singleton dimension, like the height where a surface variable is measured).

mauzey1 commented 2 years ago

@durack1 @taylor13

Okay. So instead of just displaying a warning for having a value for an attribute differing from that found in the source_id entry of the CV file, we should be throwing an error?

For the values in the source_id entry that are in arrays (values in brackets such as region or source_variables), do we just use those to validate the input? Or do we attempt to set attributes with the values in those arrays, either using the first value in the array or the entire array as a list?

wachsylon commented 2 years ago

One additional comment here for further_info_url (if not already treated). We will use CMOR3 for other projects and do not want to create further_info_urls. However, I think that it is not possible to remove it at all as it is created and checked on the same level as e.g. output_file_template here with this. I do not understand why this attribute is of such importance that it must be created when CMOR is applied so I would prefer to remove that from both the header and source.

mauzey1 commented 2 years ago

So far, I have made CMOR display warnings when it detects that a global attribute deviates from what is listed in the source_id entry of the CV file. This is assuming that the value in the entry is a single string and not an array. Below is what happens when the attribute product is set to derived in the user input file of the input4MIPs test.

varName: tosbcs units: degC axis_ids: [0, 1, 2]

C Traceback:
In function: _CV_ValidateAttribute
! called from: _CV_checkGblAttributes
! called from: cmor_setGblAttr
! called from: cmor_write
! 

!!!!!!!!!!!!!!!!!!!!!!!!!
!
! Warning: The registered CV attribute "product" as defined as "observations" will be replaced with 
! "derived" as defined in your user input file
! 
!
!!!!!!!!!!!!!!!!!!!!!!!!!

! ------
! CMOR is now closed.
! ------
! During execution we encountered:
!   1 Warning(s)
!   0 Error(s)
! ------
! Please review them.
! ------
! 

This results in a NetCDF file where the global attribute product is set to derived. If I remove product from the user input file, then it would default to the value observations. However, the value listed in the table's header for product is input4MIPs. https://github.com/PCMDI/input4MIPs-cmor-tables/blob/17163f642582a4794551703158c03e9f2d478472/Tables/input4MIPs_Omon.json#L11 Wouldn't this cause an issue with the part where we throw an error if the attribute's value deviates from that listed in the table?

taylor13 commented 2 years ago

I'm not that familiar with the input4MIPs CVs, but I think if the CV says this source provides "observations", then the user should not be able to change that to "derived". Instead the user should request that the source reference documentation be changed to indicate that the product is "derived", not "observations".

mauzey1 commented 2 years ago

Okay, so the warning should be elevated to an error indicating the correct value that an attribute should be?

durack1 commented 2 years ago

This maybe an inconsistency that requires a cleanup in the input4mips-CMOR-tables. I've not thought much about this, but will review. Not all forcing datasets are "observations" either

taylor13 commented 2 years ago

yes, let's give this some thought before making changes.