IS-ENES-Data / QA-DKRZ

Quality and CF checker of meta-data in climate related data sets (NetCDF files)
4 stars 4 forks source link

8_8b: Checksum of layout --- unclear message #18

Closed zklaus closed 5 years ago

zklaus commented 5 years ago

It is difficult to understand what the error message

Variable <plev>: Checksum of layout or data has changed across experiments, now <905492458>, previously  <1719207667>.

actually means.

I understand that as a general error message it maybe difficult to include a lot of information about the underlying change that caused the change in checksum.

However, the term "experiment" here is a bit unclear and seems too loaded with a whole bunch of meanings. Maybe this can be explained better? Or substituted with a clearer term?

h-dh commented 5 years ago

a checksum is calculated from the values of a coordinate axis (as long as this is not unlimited) and stored in $QARESULTS/tables in files beginning with 'pt' (if not a specific prefix was selected in the configuration). The name of the 'pt_' file is determined by components of the path (by default) applying two configuration statements: E.g. for CMIP6, where a path component CMIP6 is expected: DRS_PATH_BASE=CMIP6 PT_PATH_INDEX=2,3,4 The index start counting from DRS_PATH_BASE (zero-based).

The value stored in the files is used for instance to ensure that the same coordinate values are found between let's say historical and a related scenario.

However, if PT_PATH_INDEX is specified such that non-related experiments with different coordinates, respectively model layout, are found, then this would produce a false annotation. Such annotations could switched off in table PROJECT_check-list.conf.

cheers, hdh

On 6/26/19 10:38 AM, Klaus Zimmermann wrote:

It is difficult to understand what the error message

|Variable : Checksum of layout or data has changed across experiments, now <905492458>, previously <1719207667>. |

actually means.

I understand that as a general error message it maybe difficult to include a lot of information about the underlying change that caused the change in checksum.

However, the term "experiment" here is a bit unclear and seems too loaded with a whole bunch of meanings. Maybe this can be explained better? Or substituted with a clearer term?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/IS-ENES-Data/QA-DKRZ/issues/18?email_source=notifications&email_token=ACJZOW6ZPBTWTOB5VZJKMGDP4MTJJA5CNFSM4H3P2GI2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G3XZZZA, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJZOW46ME25MEZ4IWVN4YTP4MTJJANCNFSM4H3P2GIQ.

-- Dr. Heinz-Dieter Hollweg Abteilung Datenmanagement Deutsches Klimarechenzentrum GmbH (DKRZ) Bundesstraße 45a • D-20146 Hamburg • Germany

Phone: +49 40 460094-212 FAX: +49 40 460094-270 Email: hollweg@dkrz.de URL: www.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig Sitz der Gesellschaft: Hamburg Amtsgericht Hamburg HRB 39784

zklaus commented 5 years ago

Thanks for the clarification! I am not sure I follow completely, but I think we are running into problems here because in CMIP6 the same variable name is sometimes used for different coordinate values, particularly for atmospheric variables on pressure levels there is a multitude of different "plev" coordinates defined.

oloapinivad commented 5 years ago

Thanks @h-dh for the clarification. The mentioned pt file is here attached: pt_EC-Earth-Consortium_EC-Earth3_historical.txt

This is the result of 5-year analysis of a single historical CMIP6 experiment. The annotation is involving only Eday variables and it is here below:

        {
            "DRS_6": [ "Eday" ],
            "DRS_7": [ "wap", "ua", "va", "zg", "ta", "hus" ],
            "DRS_8": [ "gr" ],
            "annotation": "Variable <plev>: Checksum of layout or data has changed across experiments, now <905492458>, previously  <1719207667>.",
            "tag": "8_8b",
            "severity": "Warning"
        },

        {
            "DRS_6": [ "Eday" ],
            "DRS_7": [ "hus", "wap", "ua", "zg", "ta", "va" ],
            "DRS_8": [ "gr" ],
            "annotation": "Variable <plev>: Checksum of layout or data has changed across sub-temporal files, now <905492458>, previously  <1719207667>.",
            "tag": "8_8a",
            "severity": "Warning"
        },

Looking at the pt file it appears quite clear that the distinction for the coordinates is based on the frequency and on the variable name, not on the table. This would mean that in some specific instances (as for example the day vs. Eday tables which share the same frequency and variables but have different number of pressure levels) we may potentially get some false annotations. Is this interpretation correct?

h-dh commented 5 years ago

The entries in the table used for consistency checks are given (but not sorted) according to the pair formed by 'var, freq'. The attributes for coordinates are then associated by 'aux=' sub-entries.

If you think that 'var, freq' isn't unique, you should add the index for the path component to the table_ID (or something similar) to PT_PATH_INDEX.

cheers, hdh

zklaus commented 5 years ago

One thing that is confusing me is that the documentation (in the form of comments in the CMIP6_qa.conf file) and your comments here seem to suggest that the indices in PT_PATH_INDEX are counted from the leaf directory going up. If this were true the filename resulting from the default PT_PATH_INDEX should be pt_gr_hus_day. However, the actual resulting filename is pt_EC-Earth-Consortium_EC-Earth3-Veg_piControl, which is consistent with a counting from the top.

For reference, here is the section from CMIP6_qa.conf:

# Purpose: automatic determination of the project table name.
# Usage is the same as for LOG_PATH_INDEX.
# Note that meta-data of each file will be checked against the project table.
# So, components representing a different layout explicitely,
# e.g. different driving-models (index 7) could have different calendars.
# Note that the example provides index in [], which are not part of the name.
# /path[10]/AFR-44[9]/SMHI[8]/CCCma-CanESM2[7]/historical[6]/r1i1p1[5]/SMHI-RCA4[4]/v1[3]/dayr[2]/clh[1]
  PT_PATH_INDEX=2,3,4

What is the intended behavior? Should we clarify the documentation, perhaps with a CMIP6 example drs here?

h-dh commented 5 years ago

In deed, the documentation is obsolete; remaining from a former setting. Starting at DRS_PATH_BASE index counting goes to the right ( I usually avoid wording like leaf or root).

Thanks for reporting this bug, which will be fixed with the next update.

cheers, hdh

On 6/26/19 2:57 PM, Klaus Zimmermann wrote:

One thing that is confusing me is that the documentation (in the form of comments in the |CMIP6_qa.conf| file) and your comments here seem to > suggest that the indices in |PT_PATH_INDEX| are counted from the leaf directory going up. If this were true the filename resulting from the default |PT_PATH_INDEX| should be |pt_gr_hus_day|. However, the actual resulting filename is |pt_EC-Earth-Consortium_EC-Earth3-Veg_piControl|, which is consistent with a counting from the top.

For reference, here is the section from |CMIP6_qa.conf|:

|# Purpose: automatic determination of the project table name. # Usage is the same as for LOG_PATH_INDEX. # Note that meta-data of each file will be checked against the project table. # So, components representing a different layout explicitely, # e.g. different driving-models (index 7) could have different calendars. # Note that the example provides index in [], which are not part of the name. # /path[10]/AFR-44[9]/SMHI[8]/CCCma-CanESM2[7]/historical[6]/r1i1p1[5]/SMHI-RCA4[4]/v1[3]/dayr[2]/clh[1] PT_PATH_INDEX=2,3,4 |

What is the intended behavior? Should we clarify the documentation, perhaps with a CMIP6 example drs here?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IS-ENES-Data/QA-DKRZ/issues/18?email_source=notifications&email_token=ACJZOWY6CC7YJAILWKNG6DTP4NRR7A5CNFSM4H3P2GI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYTNXJI#issuecomment-505863077, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJZOW6XPBM7RS2C46ZPDP3P4NRR7ANCNFSM4H3P2GIQ.

-- Dr. Heinz-Dieter Hollweg Abteilung Datenmanagement Deutsches Klimarechenzentrum GmbH (DKRZ) Bundesstraße 45a • D-20146 Hamburg • Germany

Phone: +49 40 460094-212 FAX: +49 40 460094-270 Email: hollweg@dkrz.de URL: www.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig Sitz der Gesellschaft: Hamburg Amtsgericht Hamburg HRB 39784

zklaus commented 5 years ago

Great, thanks!

In your view, does this mean that the value should be changed to

PT_PATH_INDEX=8,7,6

as well, or does it already have the value you intended?

h-dh commented 5 years ago

I would like to close this thread with some final remarks

The expression 'layout' denotes coordinates with fixed values. The checksum representing the layout is not calculated for variables or coordinates depending on any UNLIMITED dimension.

a) The expression 'experiment' has in deed a different meaning between CMOR and the QA-DKRZ. The latter uses it with the meaning: 'project_activity_institution_model_CMOR-experiment', i.e. CMIP6_CMIP_IPSL_IPSL..._piControl. The idea is that for instance the same for 'historical' should have identical coordinate values. It should be clear that this is only applicable for ordinary simulations, but not in general for simulated or genuine observations. But, such would usually have a Discrete Geometry.

Whenever it happens that a layout change was annotated, but the case is not suited for such a test, then one should disable the test for the given variable.

b) The intention of automatic naming of a text file, which is consulted for consistency checks, by means of qa-dkrz options DRS_PATH_BASE=CMIP6 PT_PATH_INDEX=2,3,4 is that entities as described in a) should be distinct. Often it is necessary for a further notation for different (sub-)model specific requirements (e.g. atmos vs. ocean). In particular the latter could only be solved by qa-dkrz by elaborate means. So, this is left to the user. BTW., specific names could be given by option PT_EXPLICIT_NAME.

c) @zklaus: I don't see problems with e.g. 'plev' for different variables, because I expect unique pairs of varname:plev.

d) The good point that frequency is not suited for CMIP6 was solved. Now, table_id replaces frequency for CMIP5/6, but not for CORDEX.

e) It's always a good idea for testing a qa-dkrz run for new cases by option --next.

oloapinivad commented 5 years ago

Thanks @h-dh. I just wanted to mention that for some reasons the replacement of the frequency with the table mentioned in d) does not work for ALL the variables.

Indeed, we have experiment where both a table day and a table Eday exist. They are not correctly recognised: if I use the default PT_PATH_INDEX=2,3,4 I still get the checksum error and the pt file has only variable+frequency (e.g. zg,day,dims=time plev lat lon,standard_name=geopotential_height [...])

As you mentioned above this can be easily tuned to avoid the error (for instance we are running now with PT_PATH_INDEX=2,3,6,8).