Fix provenance logging - Githubissues

mattiarighi commented 6 years ago

The provenance logging based on the NCL procedure inquire_and_save_fileinfo in interface_scripts/logging.ncl is currently broken. This procedure uses specific global attributes in the preprocessed NetCDF file (fixfile, version, infile_NNNN, tracking_id, and reference) to document the provenance of all input data and write it in the NCL log file.

The problem is that the new preprocessor does not write such attributes in the output file. This functionality should be reimplemented.

bouweandela commented 6 years ago

It would probably be good if we could get the main esmvaltool program to take care of writing provenance information as much as possible, so diagnostic script developers need to spend minimal effort on this. That would probably mean that the preprocessor should write it's own provenance information. Of course that information will then be made available to any diagnostic scripts subsequently using the preprocessed data.

There is also a standard available for provenance information, it would probably be good to start using that: https://www.w3.org/TR/prov-overview/

Maybe @nielsdrost also has good ideas on how to tackle this topic?

nielsdrost commented 6 years ago

Yes, writing prov-o from the main esmvaltool program is exactly the approach I envision. Will produce a better write-up for others to comment on.

bouweandela commented 6 years ago

Possibly related issues: #229 and #278

bouweandela commented 6 years ago

Link to provenance design document produced by @nielsdrost.

nielsdrost commented 6 years ago

Discussion (via document) seems to not agree on if (and how) provenance info should be embedded. I propose we write it as a separate prov xml file for now.

bouweandela commented 6 years ago

Main conclusion from telco on the interface between workflow manager (esmvaltool main program) and diagnostic scripts:

The workflow manager will keep track of provenance
Diagnostic scripts will provide a file (preferably in yaml format) containing a list of output files (figures, i.e. plots/netcdf files) with per output file the names of the input files used to create that output file and any other tags/captions/etc needed. Exact format to be defined and possible useful helper library functions to be defined.
The workflow manager will use the list of output files (figures) provided by the diagnostic to write the full provenance information to the exif header of the plots/a provenance attribute in the netcdf file of those files AND write a separate file containing a list of all figures including their provenance.

First step to start implementation:

Add tags/captions/etc in perfmetrics namelist so we have an example that should be made working.

axel-lauer commented 6 years ago

Here is a brief overview of how provenance is handled in version 1.1.0. Maybe we could use at least the kind of meta data created in v1.1.0 as a starting point. The actual meta data consists of so-called "tags" (e.g. R_atmos, P_crescendo, PT_geo). These tags can be translated to more human readable information using the lists defined in the file doc/MASTER_authors-refs-acknow.txt. The meta data is passed through from the backend/interface layer to the diagnostics. The diagnostics then add more meta data and finally call an interface function that collects everything and writes the information to the exif headers of the individual figures (.png) and/or to separate .xml file(s). In v1.1.0, the tags are defined in different parts of the ESMValTool:

Backend

list of all input file(s) processed to write a given preprocessor (climo) file; information attached as attribute(s) to that file

Interface layer

time/date stamp
software versions (currently only Python + ESMValTool)
filename of namelist
tracking ID(s) (if available) of all original input files (created by going through the list of input files stored by the backend in each preprocessor file; if defined, tracking id is a global attribute in the original input file(s))

Namelist

main reference(s)
project(s)
CMIP realm(s) (e.g. atmos, ocean, land)
theme(s) (e.g. aerosols, clouds, chemistry)

Diagnostic (per plot, each figure is written to a separate file)

list of variable(s)
list of model(s)
list of preprocessor file(s)
author(s) of the diagnostic
reference(s) for the diagnostic (these are not necessarily the same as the "main reference(s)" defined in the namelist
domain (e.g. global, northern hemisphere, Europe)
plot type (e.g. time series, taylor diagram, bar chart)
list of metrics / statistics calculated (e.g. climatology, rmsd, correlation)
figure caption

bouweandela commented 6 years ago

We will probably make use of the prov library to write the provenance information to XML.

LisaBock commented 6 years ago

We made a first suggestion for the structure of the provenance with the prov library. The visualization could look like this: article-prov

nielsdrost commented 6 years ago

Awesome! Definitely looks like the way to go. This structure should fit all the information we need.

bouweandela commented 6 years ago

Work on this issue is done in the version2_provenance branch.

ESMValGroup / ESMValTool

Fix provenance logging #240