Provide specific implementation guidance for provenance metadata if a dataset is produced by a versioned business process/system

Creator Name: Joel Natividad Creator Affiliation: datHere, Inc.

Requirement(s)

When a dataset is produced/exported by a versioned business process/system, the provided provenance metadata should include:

the name/unique identifier of the process/system (e.g. process - Onboarding; system - ckanext-datajson)
the URL of the process/system (e.g. process - https://wiki.exampleagency.gov/onboarding; system - https://github.com/GSA/ckanext-datajson)
the version of the process/system, using the Semantic Versioning 2.0 standard (e.g. 0.1.21 - If a version number is not available and the system is version-controlled, its branch/tag or commit-id)

Problem Statement

Datasets are often produced by repeatable business processes/workflows. These processes/workflows often have version numbers, and are also often automated/semi-automated.

To effectively track the provenance of datasets and increase its Reusability (R1.2) and more importantly, its Reproducibility, it'd be great if we can get detailed, machine-readable provenance metadata of the process/system that produced it beyond the free text provenance statement.

Target Audience / Stakeholders

Everyone.

Intended Uses / Use Cases

Use 1. If errors are found in the data, track if the process/system that produced it, to help find the root cause of the error (is it a defect in the process/system? A data collection error?) Use 2. Reproduce the dataset. Use 3. Identify suspect datasets that need to be verified if a specific version of a system/process is found to have a defect/vulnerability Use 4. Identify suspect datasets that need to be recreated because it was created by a COTS package or a vendor that has been compromised

DOI-DO / dcat-us