huard commented 4 years ago

We're building processes operating on netCDF files, and wondering what are the best-practices in terms of metadata handling.

Imagine the following case:

input 1 -> process A \
                       process B -> output
input 2 -> process A /

Let's say for example that input 1 and 2 have CMIP-like metadata. What should happen to this metadata as it goes through processes 1 and 2? I suppose that different attributes would go through either:

Aggregation
Updates
Replacement
Deletion

For CMIP datasets we could save the PIDs from the original files, but if an input is observed data, it might not have a PID or DOI. Anyway, curious to know how to approach this systematically so that it scales to large ensembles and complex workflows with multiple processes.

Proposal (edited post-workshop)

Include an optional has_provenance attribute to global and/or variable attributes.
has_provenance would store a URL to a provenance provider , either a simple link to a file or a service able to serve machine-readable provenance information in different file formats (XML, JSON, HTML). Unique identifiers would ensure there is no ambiguity in the relationship between netCDF files and their provenance information.
The provenance file would store detailed information about input netCDF files and operations leading to this file or variable. It would complement, not replace, information already stored in netCDF metadata.
This provenance information would be encoded using an existing standard ontology, e.g. PROV, that could be augmented with user-defined domain ontologies (e.g. MetaClip).
CF would recommend setting provenance URLs that are likely to be valid over the long term, such as DOIs.
The provenance file would not be meant to enable reproducibility, as this is a much harder problem.

Rationale

When creating derived data products from large ensembles or complex chains of algorithms, it is difficult to describe the full data lineage using existing attributes (e.g. history), as they are not meant to capture complex operations (e.g. parallel vs serial operations). Even if is was possible to write all this information, the result would be effectively unreadable due to its length, nor machine-readable due to the lack of a formal grammar.

Use cases

Detecting bugs in processes by inspecting the provenance file
Identifying which version of which tool was used to generate netCDF data

huard commented 4 years ago

One solution to this issue would be for CF to adopt a "provenance" standard (e.g. PROV) and store a machine-readable provenance string in global or variable attributes.

JonathanGregory commented 4 years ago

CF has attributes which could be used for recording that kind of history, but their contents are not standardised, such as source, history and comment. They are in section 2.6.2 of the standard.

huard commented 4 years ago

Yes, but they are meant to be human-readable. I think using them to store XML would abuse their intended meaning. Also, I think it would cause all kinds of problems for frontends that would have to check whether these attributes are human or machine readable before displaying them.

sfoucher commented 4 years ago

Maybe the new OGCApi Record could be useful here (https://github.com/opengeospatial/ogcapi-records) at least for observations such as satellite imagery. In the metadata of B you could point to records from inputs 1 and 2 for instance.

cofinoa commented 4 years ago

@huard additionally to human readable attributes, the cell_methods variable attribute also provides a machine friendly process (i.e. aggregation) processes. In fact all section 7: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#_data_representative_of_cells define metadata about processes applied to data. IMO, the "provenance" proposal should be consider those as an alternative or supplementary metadata about processes been applied to data.

BTW, @larsbarring it's also working to encode metadata about process been applied to data, and in particular, computation on climate indexes. It may be would be useful to have his opinion on this.

davidhassell commented 4 years ago

Just to highlighting the synergy that Lars's issue on climate indices (https://github.com/cf-convention/discuss/issues/371) is also going to be discussed at the CF meeting next week, as well as this one.

huard commented 4 years ago

Agreed, this is not meant to replace existing attributes, but as you suggest, as supplementary information about processes.

To clarify, a use case could be to describe operations leading to the creation of a map of forest fire under future climate conditions. The provenance would list the source data from say 20 climate model projections, a regridding algorithm used to interpolate them on a common grid, a bias-adjustment algorithm and an observation dataset used to remove model biases, details about the fire model used to process the meteorological input, and then statistics applied to the ensemble of fire projections.

Although it's possible to capture the details of all these processes into a "history" field, my experience is that it becomes unreadable. So I see "history" as being a high-level human readable synthesis of operations, and provenance as the low-level machine-readable full description of operations and source files.

larsbarring commented 4 years ago

I think that the connection between the two issues is in the second bulleted point of the breakout intro:

"...the more far-reaching need for a new or alternative mechanism that allows for a more flexible description of more complex and/or multi-step temporal processing of data."

But I also think that it would be difficult to capture all the relevant aspects that you mention within a framework that could be thought of as an extension of the CF Conventions. If I understand what you are aiming for, it is a far-reaching and open-ended need.

davidhassell commented 4 years ago

Hello,

The timings and order of the breakout groups for the CF meeting next week has now been set (see http://cfconventions.org/Meetings/2020-Workshop.html), and the discussion of this issue will be on Thursday 11 June from 16:00-17:30 UTC, in parallel with three other topics.

Thanks.

ngalbraith commented 4 years ago

Regarding a new 'provenance' attribute, I'd like to suggest that we might consider allowing this metadata to be stored as a variable, instead of an attribute.

This gives the data producer more ability, when combining datasets along geographical or depth axes, to easily keep the provenance metadata with the correct part of the data, using depth or x/y indexes - hard to do this in an attribute. E.g. the salinity below a certain depth might have pressure corrections applied differently from other depths ... documenting that in a 2D salinity file using variable attributes (especially in a machine-readable form) is pretty much impossible to do through automatic processes.

Also, sometimes it's appropriate to provide provenance as a data variable attribute, and sometimes it's more appropriate as a global, but in a situation where a single term appears as both a global and a variable attribute, most software does not behave as expected (according to an old discussion on the CF list).

huard commented 4 years ago

Edited the proposal to include this idea.

bouweandela commented 4 years ago

@huard asked me to provide some of our experiences as input for this discussion. We have been recording provenance with ESMValTool, a tool for reproducing figures from published papers, with data produced for the Coupled Model Intercomparison Project (CMIP). Our input data typically (tries to) adhere to both the CF-conventions and the CMIP data request. We use the prov library to record provenance information according to the W3C PROV standard and most code related to recording provenance can be found in this file.

For each figure produced by the tool, the provenance record contains things like

all global attributes of the input NetCDF files
various settings related to the way the data is processed and the figures are produced
authors and references
version of the software

The provenance is recorded as the program runs and finally serialized to PROV-XML (though the prov library also supports other formats, so this would be easy to change should e.g. PROV-JSON prove to be more useful at some point) and stored in

A .xml file saved to the same location as the figure
If the data for the figure is available in a NetCDF file, we also store the same provenance in a global NetCDF attribute called provenance
If the figure is a .png file, we also store the same provenance in the ImageHistory attribute

This strategy of both embedding the provenance into the files as well as saving it in a separate file stems from the fact that there seems to be no consensus on what is the best way to store provenance, so we opted for doing both.

Here is an XML file produced with ESMValTool (renamed to .txt so GitHub allows me to attach it) as an example: MultiModelMean_Amon_ta_2000-2001_mean_provenance.xml.txt

A downside to storing provenance in a global attribute of the NetCDF file that we noticed is that the output of a command like ncdump -h some_file.nc becomes very large and crowded with provenance information in the hard to read XML format. Of course, it's fairly easy to work around this by using a command like ncdump -h some_file.nc | grep -v provenance.

Please note that I'm no expert on provenance, so suggestions for improvement of our implementation would be quite welcome, e.g. by commenting on issue https://github.com/ESMValGroup/ESMValCore/issues/29.

zklaus commented 4 years ago

Another effort that might be related is OGC netCDF-LD, a draft standard to encode RDF in netCDF. This could potentially be used to store PROV-O information directly in netCDF.

huard commented 4 years ago

Thanks @zklaus,

I agree it would work, but I'm somewhat skeptical about its adoption by scientists. The syntax is not easy on the eye and I think all these "bald__" attributes would cause confusion for users. So although I think we should consider it in this discussion, I'd like to propose something that does not even try to be human-readable.

zklaus commented 4 years ago

@huard, then XML is definitely your go-to ;)

Yeah, I only meant to bring it up for consideration, particularly since OGC seems to be something that comes up in several places here and since it seems that @ethanrd is involved (see OGC netCDF SWG).

huard commented 4 years ago

Totally agree that whatever is chosen should be OGC-compliant.

jbedia commented 4 years ago

Hi everyone, thanks for this interesting discussion. Following the meeting held yesterday, this is just a small sample of a RDF - JSON file containing the provenance information of a CMIP5 ensemble map of a bias-adjusted climate index based on maximum temperature. For simplicity, it is formed by only two GCMs. https://drive.google.com/file/d/1ij-JKNF2lFidbM823jwDQcOxDz4Ntx4m/view?usp=sharing

It can for instance be dropped onto the www.metaclip.org interpreter to be explored, but its complexity, with just two models, is huge. With an ensemble of ~30 models it is not possible to go for the full detail provenance information visually, and somehow the information needs to be collapsed/summarized. We are currently working on that part. An advantage of such RDF provenance representation is that it allows to expose the provenance information at different levels of granularity. It also ensures interoperability. The RDF graph can be serialized onto many different formats, not just JSON, but also for instance XML. There are some examples of climate products available in www.metaclip.org with attached provenance information that are automatically displayed in the viewer. However, the provenance information can be also linked externally via a reference attribute, as recommended for netCDF/NcML files. Comments and suggestions are very welcome

huard commented 4 years ago

Trying to push the idea of looking into provenance in OGC Testbed 17: https://github.com/opengeospatial/ideas/issues/113

cf-convention / discuss

Metadata handling through processes #33

Proposal (edited post-workshop)

Rationale

Use cases