RenskeW / runcrate-analysis

2 stars 1 forks source link

Analysis of runcrate 0.5.0

This repository documents the analysis of Workflow Run RO-Crates (WRROC) converted from CWLProv RO Bundles using runcrate. The results of this analysis are also published on Zenodo: https://doi.org/10.5281/zenodo.12689424.

The analysis follows the same methodology as previous work, in which we conducted a qualitative evaluation of metadata coverage in CWLProv (version 0.6.0). This earlier analysis was based on concrete examples of ROs associated with a realistic bioinformatics workflow. Here, we repeated the analysis for Workflow Run RO-Crate, and compared the WRROC RDF representation (in ro-crate-metadata.json) with the CWLProv RDF provenance graph.

Methods

We used the following approach and documented it in the Issues:

  1. Provenance metadata was classified into 6 categories: T1-6.
  2. For each category, we made an inventory of metadata that is contained in CWLProv RO Bundles (in RDF, and structured, non-RDF documents (packed.cwl, and primary-job.json/primary-output.json)).
  3. Subsequently, we assessed if and how this information is represented in Workflow Run RO-Crates converted by runcrate, based on a number of examples (see below).
  4. Finally, we provided suggestions how to represent metadata that is present in CWLProv but missing in RO-Crate.

Scenario 1: Analyze representation of CWL metadata fields, human agent, file characteristics, execution details

Read more...

Scenario 2: Analyze representation of SoftwareRequirement

Read more...

Scenario 3: Analyze representation of DockerRequirement

Read more...

Scenario 4: Analyze representation of String, File, Directory and File array input parameters AND ResourceRequirement

Read more...

Results

Overview of the representation of each category of the provenance taxonomy, and its representation in RO-Crate. For a detailed explanation of each of the categories, see here: https://doi.org/10.5281/zenodo.7014950.

SC1: Workflow design

Explanation of the design of the workflow and its steps can be included in the CWL metadata fields (doc, label, intent).

SC2: Entity annotations

Explanation of the meaning of individual input/output data entities can be represented as structured annotations in the CWL input parameter file (not propagated to ro-crate-metadata.json), but there is in the CWL standards v1.2 no clear guideline how to do these annotations.

SC3: Workflow execution annotations

Workflow execution annotations (why was this combination of input parameters chosen?) can be represented as annotations in the CWL input parameter file (unstructured, not propagated to ro-crate-metadata.json).

D1: Data identification

This information can be added in the CWL input parameter file as structured annotations, but there is in the CWL standards v1.2 no clear guideline how to do these annotations. I

D2: File characteristics

Filename, checksum are represented for all files, creation timestamps are available for output files. Additional structured annotations may be made in the CWL input parameter file. Filename and checksum are propagated to ro-crate-metadata.json.

D3: Data access

The CWL standards v1.2 allow specification of a remote location for data, which would serve as access to a downloadable form of the data.

D4: Parameter mapping

Mapping of input/output data to workflow parameters is represented in ro-crate-metadata.json.

SW1: Software identification

SoftwareRequirement field is propagated to ro-crate-metadata.json. SoftwareRequirement contains specs field with IRI, resolving to landing page with metadata about the tool (see CWL standards v1.2).

SW2: Software documentation

SoftwareRequirement field is propagated to ro-crate-metadata.json.

SW3: Software access

SoftwareRequirement field is propagated to ro-crate-metadata.json.

WF1: Workflow software

The workflow itself (packed.cwl) is contained in the CWLProv RO Bundle, as well as the RO-Crate produced by runcrate. Metadata/documentation about the workflow can be represented in CWL metadata fields (doc, label, intent), which are propagated to ro-crate-metadata.json. ro-crate-metadata.json also contains a description of the workflow and all its parameters and steps. The representation of the workflow in CWLProv RDF is incomplete.

WF2: Workflow parameters

Information about the workflow parameters can be represented in the CWL metadata fields (doc, label, format).

WF3: Workflow requirements

The CWL ResourceRequirement field is partially propagated to ro-crate-metadata.json (Scenario 4).

ENV1: Software environment

Absent.

ENV2: Hardware environment

Absent.

ENV3: Container image

Container image is partially represented in CWL DockerRequirement field, which is propagated to ro-crate-metadata.json (Scenario 3).

EX1: Execution timestamps

EX2: Consumed resources

Absent.

EX3: Workflow engine

EX4: Human agent