NCEAS / datateam-training

Training and reference materials for ADC and SASAP data team members
https://nceas.github.io/datateam-training/training/
Apache License 2.0
7 stars 22 forks source link

Adding provenance information to metadata records #230

Open cwbeltz opened 3 years ago

cwbeltz commented 3 years ago

Potentially begin adding provenance information to metadata records during curation. We could target two of the provenance FAIR checks, "Provenance Process Stepcode Present" and "Provenance Trace Present". The paths within the EML record are include below, along with a link to the check itself within NCEAS/metadig-checks/src/checks.

Provenance Process Stepcode Present: /eml/*/methods/methodStep/software//text()[normalize-space()]

Provenance Trace Present: /eml//methods/methodStep/dataSource or /eml//methods/methodStep/subStep/dataSource)

laijasmine commented 3 years ago

To clarify, this trying to make sure more datasets have provenance so that we meet more of the FAIR checks? We already do try to get provenance information from submitters when possible but not all of this data or information is usually submitted. We might need to reframe how the data submission is viewed in the community as not just the final data products but the path it takes to get there as well.

jeanetteclark commented 3 years ago

The goal is, for whatever datasets we add provenance to, we also add that information into the EML itself in the locations specified above. This is nice because that information will stay in the metadata record if the prov breaks, and it's also good for FAIR

mbjones commented 3 years ago

Let's have a discussion about this approach before we implement -- I've always been torn about duplicating information between the ORE and EML documents, and we have to concern ourselves with conflicting information if there are mutliple sources.

When we originally designed PROV incorporation in ORE and semantic annotations in EML, we said that they could potentially be located in either the ORE, EML, or Sysmeta, and that our systems should be ok info found in any of those. This really makes our packaging model more of a first class citizen, and the FAIR checks should acknowledge that they are checking a data package and not just a single metadata file. This is highly appropriate for FAIR. We've also had other groups complain that our FAIR checker does not look for metadata in data granules (e.g., CF in NetCDF), and therefore some really well-documented data packages in DataONE get low scores. This wholistic approach to FAIR evaluation I think would be welcomed all around, rather than treating metadata documents as standalone. Clearly this would require discussion.