bcdev / nc2zarr

A Python tool that converts NetCDF files to Zarr format
MIT License
9 stars 3 forks source link

Add validation feature #17

Open pont-us opened 3 years ago

pont-us commented 3 years ago

It would be useful to have an output validation option for nc2zarr, along the lines of

nc2zarr --config myconfig.yaml --validate

which would go through the input files specified in the configuration and make sure, for each data value corresponding to a variable specified in the configuration, that a matching data value exists in the configured output Zarr.

This would of course be potentially very time-consuming for large datasets; --structure-only (just check that the structure's as expected) and --sparse (validate some small but representative subset of the data) may also be useful.

forman commented 3 years ago

I agree, this is very usefull, but I disagree with another handfull of CLI options.

I suggest verifing that the output is readble through xr.open_dataset(path, **open_kwargs). We can configure this in as a new top-level entry in the config. As a second step we later support assertions on the dataset content:

verify:
  open_kwargs:
     engine: "zarr"
     decode_cf: true
  assertions:
     attrs:
        ...
     data_vars:
        ...
     coord_vars:
        ...        
forman commented 3 years ago

Note, I'd call the feature "verify" rather than "validate", because validate would also analyse the dataset's values which I feel is out-of-scope for nc2zarr.

forman commented 3 years ago

Another note:

pont-us commented 3 years ago

a --verify CLI flag could be used to perform a verifcation with xr.open_dataset(path) with defaults, even without verify entry in config;

This is what I had in mind, at least initially: verification just using the existing configuration for input, by checking that every specified input value (or some representative sample) is also present in the output. Of course that doesn't rule out adding a more elaborate, configurable verification facility later.

On reflection, I agree about calling it "verify". "Validate" probably implies something a bit deeper than just checking input.var[lat, lon, t] == output.var[lat, lon, t] for all variables and co-ordinates.

forman commented 3 years ago

Hi @pont-us, I started an implementation you may have a look already.

pont-us commented 3 years ago

Branch containing implementation: https://github.com/bcdev/nc2zarr/tree/forman-17-verify_dataset

pont-us commented 3 years ago

Probably not relevant to the main verification / validation implementation, but just for reference: I've committed a small standalone validation script, which I'm using to validate the converted Zarrs against selected source NetCDFs for the next deliverable.