aodn / imos-toolbox

Graphical tool for QC'ing and NetCDF'ing oceanographic datasets
GNU General Public License v3.0
45 stars 30 forks source link

high-level compliance checker #740

Open ocehugo opened 3 years ago

ocehugo commented 3 years ago

The Toolbox does not contain any specialized function to evaluate if a dataset (or a netcdf file) is compliant in relation to a defined standard. Most of the compliance is via coding practices, manual inspection, or individual testing of attributes/variables/dimensions.

At the moment, the toolbox creates netcdf fields by using templates (done by makeNetCDFCompliant) and some basic validations occur at export time (in exportNetcdf). Most validations are ad-hoc at the export time and the templates are compliant only to a certain degree since field filling is still required. Thus, protection is limited against invalid inputs and assurance of conventions is not stricly guaranteed (see #737 as an example).

The new +IMOS package advanced in that regard by controlling the creation of variables/dimensions by argument inspection and cross-checking, but this is not enough, since modifications after creation are everywhere.

Ideally, an interface to add things to a dataset would be the best solution (e.g. crud like). Another option would be to explicitly validate data fields before exporting. This got two avenues: a. evaluate the toolbox dataset struct state or b. evaluate the created netcdf file. The latter, however, is already done by python tools (cc-imos-checker/cf-checker), so another further option is to use them within matlab.

All four options got pros/cons:

  1. Creating a CRUD like interface that provides validation and always maintain a dataset valid (conventional) is powerful, but this kind of abstraction incur a deep redesign. The number of code changes is large, and given the lack of test coverage, quite costly. For example, the IMOS package is still barely used since testing for older parser/functions are inexistent.

  2. Implementing a schema validator to verify a dataset before importing is very close to what cc-imos-checker do and would early warn/block the user before files are created. However, it would be good for the tool to be generic enough to evaluate different validation schemas (e.g. imos, cf, or anything else).

  3. Implementing a schema validator to be run after exporting a dataset to netcdf is the same as writing a new cc-imos-checker in matlab. This is obviously more wasteful than option 2 since it will be creating files, reading files, and require the use of the matlab netcdf API/interface.

  4. Implementing calls to cf-checker and cc-imos-check python code at export time will not require any duplicated coding effort but would require distributing the software with the toolbox. This involves managing their versions and installation, interfacing the proper calls, and all the cross-language requirements.

I believe 4 should be investigated first, followed by 2. The only requirement for 4 is to investigate the cross-language support and how that affects the different distribution avenues used in the toolbox (mostly the binary package).

For 2, we already got some related functionality (e.g Util/Schema). The bulk of the work is selecting the right abstraction with matlab objects, rewriting the rules from cc-imos-checker/cf code, and rules in a declarative way.