DOI-DO / dcat-us

Data Catalog Vocabulary (DCAT) - United States Profile Chief Data Officers Council & Federal Committee on Statistical Methodology
Other
58 stars 6 forks source link

Provide specific implementation guidance for provenance metadata if a dataset is produced by a versioned business process/system #127

Closed jqnatividad closed 8 months ago

jqnatividad commented 11 months ago

Creator Name: Joel Natividad Creator Affiliation: datHere, Inc.

Requirement(s)

When a dataset is produced/exported by a versioned business process/system, the provided provenance metadata should include:

Problem Statement

Datasets are often produced by repeatable business processes/workflows. These processes/workflows often have version numbers, and are also often automated/semi-automated.

To effectively track the provenance of datasets and increase its Reusability (R1.2) and more importantly, its Reproducibility, it'd be great if we can get detailed, machine-readable provenance metadata of the process/system that produced it beyond the free text provenance statement.

Target Audience / Stakeholders

Everyone.

Intended Uses / Use Cases

Use 1. If errors are found in the data, track if the process/system that produced it, to help find the root cause of the error (is it a defect in the process/system? A data collection error?) Use 2. Reproduce the dataset. Use 3. Identify suspect datasets that need to be verified if a specific version of a system/process is found to have a defect/vulnerability Use 4. Identify suspect datasets that need to be recreated because it was created by a COTS package or a vendor that has been compromised

fellahst commented 8 months ago

DCAT-US indeed offers properties that address provenance, as detailed in the usage guide. However, it's important to note that providing a detailed description of the workflow steps involved in the creation of a dataset is beyond the scope of DCAT-US. For comprehensive tracking of the provenance, including specific workflow steps and versions, the PROV-O ontology is recommended as it is compatible with DCAT standard. PROV-O can articulate detailed, machine-readable provenance metadata, including aspects like the name/identifier, URL, and version of the process or system used. Additionally, dcat:qualifiedRelation within DCAT-US can be employed to refer to the provenance document or data (with provenance data role) . See Related resources section. This combination ensures thorough documentation of the provenance, addressing your requirements for tracking errors, reproducing datasets, and identifying datasets that may be affected by system or process defects or vulnerabilities.