Open joshua-hampton opened 2 years ago
Update through commit 65cdf5c:
check_var
to generic.py and added rules_attrs
keyword argument to check_global_attrs
function to help with global attribute spec checkIn general, these product specific spec checks work, e.g.
$ checksit check -t off --specs=amof-snr-winds ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc
Running with:
Template: OFF
Datafile: ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc
---------------- Running checks ------------------
[WARNING] Template checks switched off!
[INFO] File is compliant!
and
$ checksit check -t off --specs=amof-surface-met ncas-aws-10_iao_20220301_surface-met_v1.0.nc
Running with:
Template: OFF
Datafile: ncas-aws-10_iao_20220301_surface-met_v1.0.nc
---------------- Running checks ------------------
[WARNING] Template checks switched off!
[FAILED] with 8 errors:
01. [variable**************:downwelling_longwave_flux_in_air]: Does not exist in file.
02. [variable**************:downwelling_shortwave_flux_in_air]: Does not exist in file.
03. [variable**************:downwelling_total_irradiance]: Does not exist in file.
04. [variable**************:net_total_irradiance]: Does not exist in file.
05. [variable**************:qc_flag_radiation]: Does not exist in file.
06. [variable**************:qc_flag_downwelling_total_irradiance]: Does not exist in file.
07. [variable**************:qc_flag_net_total_irradiance]: Does not exist in file.
08. [dimension**************:index]: Does not exist in file.
In the second case, the variables that checksit reports as not existing do not exist in the file but should not necessarily be in the file - while they are valid variables for that data product, the ncas-aws-10
instrument does not measure radiation, and so the product-specific variables relating to radiation are omitted from the file, rather than being left in empty.
The index
dimension should not appear in this spec - while in the middle of writing this I have found the error in make_specs.py and will correct this shortly.
make_specs.py fixed in commit d527361, with updated spec file for the surface-met product.
Added section to check.py which, if template
is auto
(default), checks to see if the file given is a netCDF file, if so checks Conventions
global attribute against known AMOF standard Conventions vocab. If so, gets spec files for that file based on data product and deployment mode, sets template
to off
, and runs checksit
using those specs. For example:
$ checksit check ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc
AMOF file detected, finding correct spec files
Running with:
Template: OFF
Datafile: ncas-radar-wind-profiler-1_cdao_20221017_snr-winds_high-mode_15min_v1.0.nc
---------------- Running checks ------------------
[WARNING] Template checks switched off!
[INFO] File is compliant!
Doesn't currently add in the global attributes spec file because that has not been created correctly/in a useful manner. Also still need to work on product-specific attributes.
I've not tested this against a non-netCDF file, but I have tested against a non-AMOF netCDF file, which is correctly ignored in the AMOF-checking process.
One issue with global attributes spec was the licence
global attribute, which includes a URL. The check code is splitting the string based on a colon, which meant half of the attribute was not included. This is fixed in commit defa3b1, joining the list with a colon, and only applies to the regex
checks.
In the general global attributes, there are two that have vocab checks - source
and platform
.
platform
is taken from the AMF_platfrom CV file, with any of the locations being acceptable. In vocab speak - __vocabs__:AMF_CVs/AMF_platform:platform:*
where *
represents anysource
is from the AMF_ncas_instrument (or AMF_community_instrument) CV file, however this then depends on instrument name, i.e. __vocabs__:AMF_CVs/AMF_ncas_instrument:ncas_instrument:<instrument_name>:description
To help toward this, commit 6c5b8a0 introduces the idea of wildcard to cvs.py. The wildcard can be used in specifying the vocab check, however at the moment it will only work if the wildcard is the last argument in the check. This means that platform
will work, but source
will not. The reason for this is that the vocab check is going through dictionaries until it gets to the required value(s) at the end, either a single value or a list of values. The wildcard returns a list of values. If the wildcard is used before the end, the vocab check will get confused at the next stage trying to take a key of a list. I think I would too.
Commit 1d75c78 allows for wildcard to appear anywhere in specs vocab check.
A (hopefully not too confusing) generic example to help (mostly me) understand, considering the very generic dictionary
{'a' : {'b' : 'c', 'd' : {'e' : 'f', 'g' : 'h'} },
'i' : {'b' : 'j', 'd' : {'e' : 'k', 'g' : 'l'} },
'm' : {'b' : 'n', 'd' : {'e' : 'o', 'g' : 'p'} } }
The following checks get the (also) following results:
"__all__" => ['a', 'i', 'm']
"a:__all__" => ['b', 'd']
"a:d:__all__" => ['e', 'g']
"__all__:b" => ['c', 'j', 'n']
"__all__:d" => [{'e' : 'f', 'g' : 'h'}, {'e': 'k', 'g' : 'l'}, {'e' : 'o', 'g' : 'p'}]
"__all__:d:e => ['f', 'k', 'o']
Final issue that needs dealing with for AMOF spec checks (I think) - how to deal with product-specific variables that are not in the netCDF file? The AMOF standard allows for these variables to not be there, but at the moment checksit expects all possible variables to be present. One option - create a function similar to checksit.generic.check_var that returns info or warning rather than error if variable is not present?
Great work @joshua-hampton,
I wonder if we should have a way of specifying that any component (i.e. a dimension, variable or an attribute) could be optional. If so, we just need a nice clean way of identifying them.
Couple of thoughts on how we could do that:
__optional__
, then within the checksit.generic.check_XXX function check to see if that string is present before running check on that value.required
which takes a boolean value for each variable/dimension/attribute name being passed to the function (or the inverse parameter called optional
)Another thought - global attributes sampling_interval
and averaging_interval
have a Compliance Checking Rule in the spreadsheets that they should be a string with at least two characters - do we want to add an extra check that they should have units of time in them?
Idea on applying optional status to things in spec checks (commit https://github.com/cedadev/checksit/commit/c4fe779ae940517768963cf542d63477b9653549). When making spec files for data products, variables and dimensions being checked have :__OPTIONAL__
appended to them (e.g. see amof-surface-met.yml). The functions in generic.py then check the variables and dimensions to see if :__OPTIONAL__
is in each one. If it is, the check still happens, but (currently) no error is reported if the variable/dimension is not present. If :__OPTIONAL__
is not in the variable/dimension name, then the check happens as it previously did.
@joshua-hampton: sounds good 👍
Commit https://github.com/cedadev/checksit/commit/31a02090abad1a61dc25d3f5c4b5e5702d87bb4e - Added "warnings" as a returned parameter from all functions in generic.py. When optional product-specific dimensions or variables are missing, a warning is added, which passes back up the chain through specs.py to check.py, where they are printed out along with errors/file compliance, e.g.
(checksit_env) (base) [earjham@rsg1-202206011551 checksit]$ checksit check ncas-aws-10_iao_20220301_surface-met_v1.0.nc
AMOF file detected, finding correct spec files
Running with:
Template: OFF
Spec Files: ['amof-common-land', 'amof-surface-met', 'amof-global-attrs']
Datafile: ncas-aws-10_iao_20220301_surface-met_v1.0.nc
---------------- Running checks ------------------
[WARNING] Template checks switched off!
[FAILED] with 5 errors:
01. [global-attributes:******:creator_email]*** Value 'CHANGE: E-mail address of the person who made the file. Valid email' does not match regex rule: 'valid-email'.
02. [global-attributes:******:creator_url]*** Value 'CHANGE: ORCID URL of the person who made the file. Valid URL' does not match regex rule: 'valid-url'.
03. [global-attributes:******:calibration_certification_date]*** Value 'CHANGE: when bought' does not match regex rule: 'datetime-or-na'.
04. [global-attributes:******:processing_level]*** Value '1' is not of required type: 'integer'.
05. [global-attributes:******:platform_altitude]*** Value 'CHANGE: Altitude of the deployment position above MSL. Enter Not Applicable for air. Exact match: <number> m' does not match regular expression: '^\d+\.?\d*\sm$'.
[WARNING] 7 warnings about file:
01. [variable**************:downwelling_longwave_flux_in_air]: Optional variable does not exist in file.
02. [variable**************:downwelling_shortwave_flux_in_air]: Optional variable does not exist in file.
03. [variable**************:downwelling_total_irradiance]: Optional variable does not exist in file.
04. [variable**************:net_total_irradiance]: Optional variable does not exist in file.
05. [variable**************:qc_flag_radiation]: Optional variable does not exist in file.
06. [variable**************:qc_flag_downwelling_total_irradiance]: Optional variable does not exist in file.
07. [variable**************:qc_flag_net_total_irradiance]: Optional variable does not exist in file.
I've also added a flag to the check
command in cli.py - -w/--ignore-warnings
which means the warnings don't print out, e.g.
(checksit_env) (base) [earjham@rsg1-202206011551 checksit]$ checksit check -w ncas-aws-10_iao_20220301_surface-met_v1.0.nc
AMOF file detected, finding correct spec files
Running with:
Template: OFF
Spec Files: ['amof-common-land', 'amof-surface-met', 'amof-global-attrs']
Datafile: ncas-aws-10_iao_20220301_surface-met_v1.0.nc
---------------- Running checks ------------------
[WARNING] Template checks switched off!
[FAILED] with 5 errors:
01. [global-attributes:******:creator_email]*** Value 'CHANGE: E-mail address of the person who made the file. Valid email' does not match regex rule: 'valid-email'.
02. [global-attributes:******:creator_url]*** Value 'CHANGE: ORCID URL of the person who made the file. Valid URL' does not match regex rule: 'valid-url'.
03. [global-attributes:******:calibration_certification_date]*** Value 'CHANGE: when bought' does not match regex rule: 'datetime-or-na'.
04. [global-attributes:******:processing_level]*** Value '1' is not of required type: 'integer'.
05. [global-attributes:******:platform_altitude]*** Value 'CHANGE: Altitude of the deployment position above MSL. Enter Not Applicable for air. Exact match: <number> m' does not match regular expression: '^\d+\.?\d*\sm$'.
Another thought - global attributes
sampling_interval
andaveraging_interval
have a Compliance Checking Rule in the spreadsheets that they should be a string with at least two characters - do we want to add an extra check that they should have units of time in them?
@joshua-hampton: I think all structural rules like this should be referred to Barb.
Testing this with a file that isn't mine has produced some results of note:
feature_type
instead of featureType
. In this case, checksit has correctly identified an error with the file, however I wonder if it would be helpful/possible to write something that identifies a "close mistake" and offers a suggestion on how to fix.altitudes
rather than altitude
. As altitude
is an optional dimension, checksit raises a warning about the file, but would still call it compliant, which in this case is probably wrong. Again, a "did you mean this" or "I wasn't expecting to see this" would provide the user with more information when they might be thinking "I'm sure I put altitude in".platform
global attribute requires an exact match to one of the platform_ids here. For an instrument not at any of those locations (e.g. anything on fieldwork I'd guess), none of these options might make sense.long_name
) that might be worth checking. Some errors in the file I spotted that checksit didn't is with qc_flag:flag_values
and qc_flag:flag_meanings
, and these are common errors that I've seen from many files (and I made myself once) and fail a CF checker:
flag_values
are a single string looking like "0b,1b,2b..."
, which looks right when looking at the spreadsheet, however the CF conventions state they should be the same type as the variable to which it is attached, and contains a list of the possible flag values
, i.e. a list of byte values 0b 1b 2b...
.flag_meanings
also appear to have been copied from the spreadsheet, and are given as a single string separated by \n
, CF conventions state they should be space separated.@joshua-hampton: I really like your ideas for giving people hints on what they might have meant.
I've found another issue, when doing a rule check on a global attribute. Specifically, this one from specs/groups/amof-global-attrs.yml
geospatial_bounds: rule-func:string-of-length:8+
In one netCDF file, the attribute was spelt incorrectly, and dct['global_attributes'].get(attr, UNDEFINED)
(line 61 in generic.py) returned UNDEFINED
. However, UNDEFINED
matches rule-func:string-of-length:8+
, and so passed the checker.
Suggested fix - the check for attribute existence is done separately. This may need to be done in other places too.
Played around a bit with the idea of giving hints on potentially misspelt items (https://github.com/cedadev/checksit/commit/da93da709d73099f347d648a0b8824cdc1c5ccee). If a variable/attribute/dimension specified in the spec file is not found in the netCDF file, a function is called that takes the name of what should be there, produces a set of possible "close matches" defined as up to two errors, and then looks through these close matches to see if any of them match anything in the file. For example, if a file had the variable yr
instead of year
, and the global attribute sauce
rather than source
, then checksit would return
1. [variable**************:year]: Does not exist in file. 'yr' was found in this file, should this be 'year'?
2. [global-attributes:**************:source]: Attribute 'source' does not exist. 'sauce' was found in this file, should this be 'source'?
There are a few limitations on this method. Firstly, I've assumed that the first character will be correct. This is mostly because you could otherwise get
23. [dimension**************:altitude]: Optional dimension does not exist in file. 'latitude' was found in this file, should this be 'altitude'?
Secondly, if there are similar variable/attribute/dimension names intended to be in the file, they can come up as suggestions. For example, the snr-winds
data product has variables skew_of_beam_1
, skew_of_beam_2
, and skew_of_beam_3
. If one of those was spelt incorrectly, e.g. skew_of_bem_2
, you might end up with the following return
1. [variable**************:skew_of_beam_2]: Optional variable does not exist in file. 'skew_of_beam_1' was found in this file, should this be 'skew_of_beam_2'?
The functions that are doing this are currently in checksit/generic.py
, but I guess if they are useful they could be moved somewhere else (checksit/utils.py
?) to be available elsewhere.
Commit https://github.com/cedadev/checksit/commit/843444cec16dbf1ab633c83b4dd9ffea71e787a0 - changed specs to check values of variable attributes. Errors such as
[variable**************:time]: Attribute 'long_name' must have definition Time (seconds since 1970-01-01 00:00:00), not Tim (seconds since 1970-01-01 00:00:00).
are now caught.
Also found a bug when logging in "compact" mode, checksit would attempt to run template check regardless of whether one was required or not (e.g. AMOF netCDFs only using specs). This is fixed in https://github.com/cedadev/checksit/commit/15183b0a88f8c8ec65a4d53117120612cb423cdc
CLI option to skip spellchecking added in https://github.com/cedadev/checksit/commit/51f7f8e5af094bde15737c4b994ca782db36fe31. Creating a pull request to main with latest changes
Something that still needs doing, how to work with newer releases of NCAS-GENERAL standard. This will require:
As with issue #16, I have also made some progress on the idea of using specs to check AMOF files. In this approach, I have created a number of spec files, one that looks at land variables, and one that does common global attributes.
To accommodate this, I made some changes to generic.py, where the functions run by the spec checker live
check_global_attrs
: added regex checkscheck_var_exists
andcheck_dim_exists
. Quite self-explanatoryThis approach doesn't (currently) check the values of variable attributes, just that they exist. This is better suited for checking
valid_min
(compare with discussion in issue #16), but not for checkingunits
, for example.I think this approach will also work well when all variables have the same attributes; however, that's not the case (e.g. quality control variables have
flag_values
andflag_meanings
, which don't exist in other variables). I guess that the specs can be written to do checks on QC variables separately to other variables.Example of running check against these spec files: