cedadev / checksit

File-checking made simple
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Specs approach for AMOF files #17

Open joshua-hampton opened 1 year ago

joshua-hampton commented 1 year ago

As with issue #16, I have also made some progress on the idea of using specs to check AMOF files. In this approach, I have created a number of spec files, one that looks at land variables, and one that does common global attributes.

To accommodate this, I made some changes to generic.py, where the functions run by the spec checker live

This approach doesn't (currently) check the values of variable attributes, just that they exist. This is better suited for checking valid_min (compare with discussion in issue #16), but not for checking units, for example.

I think this approach will also work well when all variables have the same attributes; however, that's not the case (e.g. quality control variables have flag_values and flag_meanings, which don't exist in other variables). I guess that the specs can be written to do checks on QC variables separately to other variables.

Example of running check against these spec files:

checksit check -t off --specs=amof-land-common,amof-global-attrs-common /gws/nopw/j04/ncas_obs/public/cdao/ncas-radar-wind-profiler-1/ncas-radar-wind-profiler-1_cdao_20221020_snr-winds_low-mode_15min_v1.0.nc

Running with:
    Template: OFF
    Datafile: /gws/nopw/j04/ncas_obs/public/cdao/ncas-radar-wind-profiler-1/ncas-radar-wind-profiler-1_cdao_20221020_snr-winds_low-mode_15min_v1.0.nc

---------------- Running checks ------------------

[WARNING] Template checks switched off!
[FAILED] with 3 errors:

    01. [variable**************:cows]: Does not exist in file.
    02. [dimension**************:fields]: Does not exist in file.
    03. [global-attributes:******:instrument_serial_number]: '' does not match regex pattern '.{3,}'.
joshua-hampton commented 1 year ago

Update through commit 65cdf5c:

In general, these product specific spec checks work, e.g.

$ checksit check -t off --specs=amof-snr-winds ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc 

Running with:
    Template: OFF
    Datafile: ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc

---------------- Running checks ------------------

[WARNING] Template checks switched off!
[INFO] File is compliant!

and

$ checksit check -t off --specs=amof-surface-met ncas-aws-10_iao_20220301_surface-met_v1.0.nc 

Running with:
    Template: OFF
    Datafile: ncas-aws-10_iao_20220301_surface-met_v1.0.nc

---------------- Running checks ------------------

[WARNING] Template checks switched off!
[FAILED] with 8 errors:

    01. [variable**************:downwelling_longwave_flux_in_air]: Does not exist in file.
    02. [variable**************:downwelling_shortwave_flux_in_air]: Does not exist in file.
    03. [variable**************:downwelling_total_irradiance]: Does not exist in file.
    04. [variable**************:net_total_irradiance]: Does not exist in file.
    05. [variable**************:qc_flag_radiation]: Does not exist in file.
    06. [variable**************:qc_flag_downwelling_total_irradiance]: Does not exist in file.
    07. [variable**************:qc_flag_net_total_irradiance]: Does not exist in file.
    08. [dimension**************:index]: Does not exist in file.

In the second case, the variables that checksit reports as not existing do not exist in the file but should not necessarily be in the file - while they are valid variables for that data product, the ncas-aws-10 instrument does not measure radiation, and so the product-specific variables relating to radiation are omitted from the file, rather than being left in empty.

The index dimension should not appear in this spec - while in the middle of writing this I have found the error in make_specs.py and will correct this shortly.

joshua-hampton commented 1 year ago

make_specs.py fixed in commit d527361, with updated spec file for the surface-met product.

joshua-hampton commented 1 year ago

Commit c72106f adds making spec files for deployment modes, e.g. for land

joshua-hampton commented 1 year ago

Added section to check.py which, if template is auto (default), checks to see if the file given is a netCDF file, if so checks Conventions global attribute against known AMOF standard Conventions vocab. If so, gets spec files for that file based on data product and deployment mode, sets template to off, and runs checksit using those specs. For example:

$ checksit check ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc 

AMOF file detected, finding correct spec files

Running with:
    Template: OFF
    Datafile: ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc

---------------- Running checks ------------------

[WARNING] Template checks switched off!
[INFO] File is compliant!

Doesn't currently add in the global attributes spec file because that has not been created correctly/in a useful manner. Also still need to work on product-specific attributes.

I've not tested this against a non-netCDF file, but I have tested against a non-AMOF netCDF file, which is correctly ignored in the AMOF-checking process.

joshua-hampton commented 1 year ago

One issue with global attributes spec was the licence global attribute, which includes a URL. The check code is splitting the string based on a colon, which meant half of the attribute was not included. This is fixed in commit defa3b1, joining the list with a colon, and only applies to the regex checks.

joshua-hampton commented 1 year ago

In the general global attributes, there are two that have vocab checks - source and platform.

To help toward this, commit 6c5b8a0 introduces the idea of wildcard to cvs.py. The wildcard can be used in specifying the vocab check, however at the moment it will only work if the wildcard is the last argument in the check. This means that platform will work, but source will not. The reason for this is that the vocab check is going through dictionaries until it gets to the required value(s) at the end, either a single value or a list of values. The wildcard returns a list of values. If the wildcard is used before the end, the vocab check will get confused at the next stage trying to take a key of a list. I think I would too.

joshua-hampton commented 1 year ago

Commit 1d75c78 allows for wildcard to appear anywhere in specs vocab check.

A (hopefully not too confusing) generic example to help (mostly me) understand, considering the very generic dictionary

{'a' : {'b' : 'c', 'd' : {'e' : 'f', 'g' : 'h'} },
 'i' : {'b' : 'j', 'd' : {'e' : 'k', 'g' : 'l'} },
 'm' : {'b' : 'n', 'd' : {'e' : 'o', 'g' : 'p'} } }

The following checks get the (also) following results:

"__all__" => ['a', 'i', 'm']
"a:__all__" => ['b', 'd']
"a:d:__all__" => ['e', 'g']
"__all__:b" => ['c', 'j', 'n']
"__all__:d" => [{'e' : 'f', 'g' : 'h'}, {'e': 'k', 'g' : 'l'}, {'e' : 'o', 'g' : 'p'}]
"__all__:d:e => ['f', 'k', 'o']
joshua-hampton commented 1 year ago

Final issue that needs dealing with for AMOF spec checks (I think) - how to deal with product-specific variables that are not in the netCDF file? The AMOF standard allows for these variables to not be there, but at the moment checksit expects all possible variables to be present. One option - create a function similar to checksit.generic.check_var that returns info or warning rather than error if variable is not present?

agstephens commented 1 year ago

Great work @joshua-hampton,

I wonder if we should have a way of specifying that any component (i.e. a dimension, variable or an attribute) could be optional. If so, we just need a nice clean way of identifying them.

joshua-hampton commented 1 year ago

Couple of thoughts on how we could do that:

  1. Have the option to append names with something like __optional__, then within the checksit.generic.check_XXX function check to see if that string is present before running check on that value.
  2. Add an extra parameter to the checksit.generic.check_XXX functions called required which takes a boolean value for each variable/dimension/attribute name being passed to the function (or the inverse parameter called optional)
joshua-hampton commented 1 year ago

Another thought - global attributes sampling_interval and averaging_interval have a Compliance Checking Rule in the spreadsheets that they should be a string with at least two characters - do we want to add an extra check that they should have units of time in them?

joshua-hampton commented 1 year ago

Idea on applying optional status to things in spec checks (commit https://github.com/cedadev/checksit/commit/c4fe779ae940517768963cf542d63477b9653549). When making spec files for data products, variables and dimensions being checked have :__OPTIONAL__ appended to them (e.g. see amof-surface-met.yml). The functions in generic.py then check the variables and dimensions to see if :__OPTIONAL__ is in each one. If it is, the check still happens, but (currently) no error is reported if the variable/dimension is not present. If :__OPTIONAL__ is not in the variable/dimension name, then the check happens as it previously did.

agstephens commented 1 year ago

@joshua-hampton: sounds good 👍

joshua-hampton commented 1 year ago

Commit https://github.com/cedadev/checksit/commit/31a02090abad1a61dc25d3f5c4b5e5702d87bb4e - Added "warnings" as a returned parameter from all functions in generic.py. When optional product-specific dimensions or variables are missing, a warning is added, which passes back up the chain through specs.py to check.py, where they are printed out along with errors/file compliance, e.g.

(checksit_env) (base) [earjham@rsg1-202206011551 checksit]$ checksit check ncas-aws-10_iao_20220301_surface-met_v1.0.nc

AMOF file detected, finding correct spec files

Running with:
    Template: OFF
    Spec Files: ['amof-common-land', 'amof-surface-met', 'amof-global-attrs']
    Datafile: ncas-aws-10_iao_20220301_surface-met_v1.0.nc

---------------- Running checks ------------------

[WARNING] Template checks switched off!
[FAILED] with 5 errors:

    01. [global-attributes:******:creator_email]*** Value 'CHANGE: E-mail address of the person who made the file. Valid email' does not match regex rule: 'valid-email'.
    02. [global-attributes:******:creator_url]*** Value 'CHANGE: ORCID URL of the person who made the file. Valid URL' does not match regex rule: 'valid-url'.
    03. [global-attributes:******:calibration_certification_date]*** Value 'CHANGE: when bought' does not match regex rule: 'datetime-or-na'.
    04. [global-attributes:******:processing_level]*** Value '1' is not of required type: 'integer'.
    05. [global-attributes:******:platform_altitude]*** Value 'CHANGE: Altitude of the deployment position above MSL. Enter Not Applicable for air. Exact match: <number> m' does not match regular expression: '^\d+\.?\d*\sm$'.

[WARNING] 7 warnings about file:

    01. [variable**************:downwelling_longwave_flux_in_air]: Optional variable does not exist in file.
    02. [variable**************:downwelling_shortwave_flux_in_air]: Optional variable does not exist in file.
    03. [variable**************:downwelling_total_irradiance]: Optional variable does not exist in file.
    04. [variable**************:net_total_irradiance]: Optional variable does not exist in file.
    05. [variable**************:qc_flag_radiation]: Optional variable does not exist in file.
    06. [variable**************:qc_flag_downwelling_total_irradiance]: Optional variable does not exist in file.
    07. [variable**************:qc_flag_net_total_irradiance]: Optional variable does not exist in file.

I've also added a flag to the check command in cli.py - -w/--ignore-warnings which means the warnings don't print out, e.g.

(checksit_env) (base) [earjham@rsg1-202206011551 checksit]$ checksit check -w ncas-aws-10_iao_20220301_surface-met_v1.0.nc

AMOF file detected, finding correct spec files

Running with:
    Template: OFF
    Spec Files: ['amof-common-land', 'amof-surface-met', 'amof-global-attrs']
    Datafile: ncas-aws-10_iao_20220301_surface-met_v1.0.nc

---------------- Running checks ------------------

[WARNING] Template checks switched off!
[FAILED] with 5 errors:

    01. [global-attributes:******:creator_email]*** Value 'CHANGE: E-mail address of the person who made the file. Valid email' does not match regex rule: 'valid-email'.
    02. [global-attributes:******:creator_url]*** Value 'CHANGE: ORCID URL of the person who made the file. Valid URL' does not match regex rule: 'valid-url'.
    03. [global-attributes:******:calibration_certification_date]*** Value 'CHANGE: when bought' does not match regex rule: 'datetime-or-na'.
    04. [global-attributes:******:processing_level]*** Value '1' is not of required type: 'integer'.
    05. [global-attributes:******:platform_altitude]*** Value 'CHANGE: Altitude of the deployment position above MSL. Enter Not Applicable for air. Exact match: <number> m' does not match regular expression: '^\d+\.?\d*\sm$'.
agstephens commented 1 year ago

Another thought - global attributes sampling_interval and averaging_interval have a Compliance Checking Rule in the spreadsheets that they should be a string with at least two characters - do we want to add an extra check that they should have units of time in them?

@joshua-hampton: I think all structural rules like this should be referred to Barb.

joshua-hampton commented 1 year ago

Testing this with a file that isn't mine has produced some results of note:

  1. This particular file uses the global attribute feature_type instead of featureType. In this case, checksit has correctly identified an error with the file, however I wonder if it would be helpful/possible to write something that identifies a "close mistake" and offers a suggestion on how to fix.
  2. Similarly, this file calls a dimension altitudes rather than altitude. As altitude is an optional dimension, checksit raises a warning about the file, but would still call it compliant, which in this case is probably wrong. Again, a "did you mean this" or "I wasn't expecting to see this" would provide the user with more information when they might be thinking "I'm sure I put altitude in".
  3. The platform global attribute requires an exact match to one of the platform_ids here. For an instrument not at any of those locations (e.g. anything on fieldwork I'd guess), none of these options might make sense.
  4. The spec checks do not check the value of a variable attribute, just for it's existence. There are a number of variable attributes that have defined values (e.g. long_name) that might be worth checking. Some errors in the file I spotted that checksit didn't is with qc_flag:flag_values and qc_flag:flag_meanings, and these are common errors that I've seen from many files (and I made myself once) and fail a CF checker:
    • the flag_values are a single string looking like "0b,1b,2b...", which looks right when looking at the spreadsheet, however the CF conventions state they should be the same type as the variable to which it is attached, and contains a list of the possible flag values, i.e. a list of byte values 0b 1b 2b....
    • flag_meanings also appear to have been copied from the spreadsheet, and are given as a single string separated by \n, CF conventions state they should be space separated.
agstephens commented 1 year ago

@joshua-hampton: I really like your ideas for giving people hints on what they might have meant.

joshua-hampton commented 1 year ago

I've found another issue, when doing a rule check on a global attribute. Specifically, this one from specs/groups/amof-global-attrs.yml

geospatial_bounds: rule-func:string-of-length:8+

In one netCDF file, the attribute was spelt incorrectly, and dct['global_attributes'].get(attr, UNDEFINED) (line 61 in generic.py) returned UNDEFINED. However, UNDEFINED matches rule-func:string-of-length:8+, and so passed the checker.
Suggested fix - the check for attribute existence is done separately. This may need to be done in other places too.

joshua-hampton commented 1 year ago

Played around a bit with the idea of giving hints on potentially misspelt items (https://github.com/cedadev/checksit/commit/da93da709d73099f347d648a0b8824cdc1c5ccee). If a variable/attribute/dimension specified in the spec file is not found in the netCDF file, a function is called that takes the name of what should be there, produces a set of possible "close matches" defined as up to two errors, and then looks through these close matches to see if any of them match anything in the file. For example, if a file had the variable yr instead of year, and the global attribute sauce rather than source, then checksit would return

1. [variable**************:year]: Does not exist in file. 'yr' was found in this file, should this be 'year'?
2. [global-attributes:**************:source]: Attribute 'source' does not exist. 'sauce' was found in this file, should this be 'source'?

There are a few limitations on this method. Firstly, I've assumed that the first character will be correct. This is mostly because you could otherwise get

23. [dimension**************:altitude]: Optional dimension does not exist in file. 'latitude' was found in this file, should this be 'altitude'?

Secondly, if there are similar variable/attribute/dimension names intended to be in the file, they can come up as suggestions. For example, the snr-winds data product has variables skew_of_beam_1, skew_of_beam_2, and skew_of_beam_3. If one of those was spelt incorrectly, e.g. skew_of_bem_2, you might end up with the following return

1. [variable**************:skew_of_beam_2]: Optional variable does not exist in file. 'skew_of_beam_1' was found in this file, should this be 'skew_of_beam_2'?

The functions that are doing this are currently in checksit/generic.py, but I guess if they are useful they could be moved somewhere else (checksit/utils.py?) to be available elsewhere.

joshua-hampton commented 1 year ago

Commit https://github.com/cedadev/checksit/commit/843444cec16dbf1ab633c83b4dd9ffea71e787a0 - changed specs to check values of variable attributes. Errors such as

[variable**************:time]: Attribute 'long_name' must have definition Time (seconds since 1970-01-01 00:00:00), not Tim (seconds since 1970-01-01 00:00:00).

are now caught.

joshua-hampton commented 1 year ago

Also found a bug when logging in "compact" mode, checksit would attempt to run template check regardless of whether one was required or not (e.g. AMOF netCDFs only using specs). This is fixed in https://github.com/cedadev/checksit/commit/15183b0a88f8c8ec65a4d53117120612cb423cdc

joshua-hampton commented 1 year ago

CLI option to skip spellchecking added in https://github.com/cedadev/checksit/commit/51f7f8e5af094bde15737c4b994ca782db36fe31. Creating a pull request to main with latest changes

joshua-hampton commented 1 year ago

Something that still needs doing, how to work with newer releases of NCAS-GENERAL standard. This will require: