aodn / python-aodntools

Repository for templates and code relating to generating standard NetCDF files for the Australia Ocean Data Network
GNU Lesser General Public License v3.0
10 stars 3 forks source link

Appropriate validation at different stages of dataset creation #14

Closed mhidas closed 5 years ago

mhidas commented 6 years ago

There are different levels of validation we can apply at different stages:

I think validation can be applied at the beginning, but checking for completeness and consistency can be left until just before creating the file. The code is doing something close to this already, but we need to make sure it's doing the right amount of validation at each stage.

ocehugo commented 6 years ago

I would rename it to match the usage:

  1. Template validation [a part of a netcdf file: a writable/non-writtable structure]
  2. Netcdf validation [a writtable structure]
mhidas commented 5 years ago

@ocehugo I'm not sure I understand your suggestion. How do the concepts of schema validation, completeness and consistency (as described above) map to your names?

I think these concepts are independent of whether you're talking about a netcdf file or a template. Actually the only difference is that a DatasetTemplate object is able to represent an incomplete, inconsistent template, but you can't write a netcdf until you've satisfied all three conditions.

(So maybe I've just answered my own question...)

mhidas commented 5 years ago

Just to get it clear in my head before I start messing around in the code, here's what I'm thinking

At template creation time, just validate the dictionaries (or JSON) provided against the schema we have already specified. Checks include

At netCDF writing time (just before attempting to write the file)

  1. Validate the schema again, as above;
  2. Ensure that all the necessary special keys are present, automatically filling in missing information if possible (e.g. set "_datatype" attribute from type of the array assgined to "_data").
  3. Ensure consistency
    • between "_dimensions" of variables and those defined for the dataset (auto update their sizes from the data arrays if not set in template); and
    • between "_datatype" and type of "_data" for each variable.