Closed jqnatividad closed 8 months ago
DCAT-US indeed offers properties that address provenance, as detailed in the usage guide. However, it's important to note that providing a detailed description of the workflow steps involved in the creation of a dataset is beyond the scope of DCAT-US. For comprehensive tracking of the provenance, including specific workflow steps and versions, the PROV-O ontology is recommended as it is compatible with DCAT standard. PROV-O can articulate detailed, machine-readable provenance metadata, including aspects like the name/identifier, URL, and version of the process or system used. Additionally, dcat:qualifiedRelation
within DCAT-US can be employed to refer to the provenance document or data (with provenance data role) . See Related resources section. This combination ensures thorough documentation of the provenance, addressing your requirements for tracking errors, reproducing datasets, and identifying datasets that may be affected by system or process defects or vulnerabilities.
Creator Name: Joel Natividad Creator Affiliation: datHere, Inc.
Requirement(s)
When a dataset is produced/exported by a versioned business process/system, the provided provenance metadata should include:
Problem Statement
Datasets are often produced by repeatable business processes/workflows. These processes/workflows often have version numbers, and are also often automated/semi-automated.
To effectively track the provenance of datasets and increase its Reusability (R1.2) and more importantly, its Reproducibility, it'd be great if we can get detailed, machine-readable provenance metadata of the process/system that produced it beyond the free text provenance statement.
Target Audience / Stakeholders
Everyone.
Intended Uses / Use Cases
Use 1. If errors are found in the data, track if the process/system that produced it, to help find the root cause of the error (is it a defect in the process/system? A data collection error?) Use 2. Reproduce the dataset. Use 3. Identify suspect datasets that need to be verified if a specific version of a system/process is found to have a defect/vulnerability Use 4. Identify suspect datasets that need to be recreated because it was created by a COTS package or a vendor that has been compromised