common-workflow-language / cwltool

Common Workflow Language reference implementation
https://cwltool.readthedocs.io/
Apache License 2.0
324 stars 225 forks source link

Confusing provenance model (`wasStartedBy`/`qualifiedStart`/`wasEndedBy`/`qualifiedEnd`) #2007

Open avillar opened 1 month ago

avillar commented 1 month ago

Actual Behavior

I'll use PROV-N notation here, but naturally the same applies to the RDF versions.

When generating a provenance trace for a cwl run, an Agent is defined to represent the tool running the workflow, like so:

agent(id:agent-id, [prov:type='prov:SoftwareAgent', prov:type='wfprov:WorkflowEngine', prov:label="cwltool 3.1.20240508115724"])

This Agent is also used as an activity in a way that IMHO is confusing. For example:

wasStartedBy(id:main-activity-id, -, id:agent-id, 2024-06-03T12:05:08.637959)

The Agent here is used as the starter Activity for the main activity. While PROV-O doesn't prevent an Agent from also being an Activity, the semantics in here are a bit confusing, since the Agent is one thing (the software that ran the workflow) and the starter Activity is a different one (a given execution of the software), and in this provenance trace we're conflating the two of them.

A similar pattern is followed again when declaring a wasStartedBy for the software agent:

wasStartedBy(id:agent-id, -, id:empty-agent-id, 2024-06-03T13:25:29.034671)
agent(id:empty-agent-id)
agent(id:avillar, [prov:type='schema:Person', prov:type='prov:Person', prov:label="Alejandro Villar", foaf:name="Alejandro Villar", schema:name="Alejandro Villar"])
actedOnBehalfOf(id:empty-agent-id, id:avillar, -)

An empty (i.e., no descriptive metadata) agent (the host?) is generated and said to act on behalf of the user (myself in this case), and this empty agent is bound to the software ageng via wasStartedBy, effectively declaring that both are Activities.

Suggested Behavior

The Activity for starting/stopping the software agent should be separated from the Agent itself. I think something like the following (simplified) diagram would make more sense:

+-----------------+  wasStartedBy        +--------------------+
| starterActivity | <------------------- |    mainActivity    |
+-----------------+                      +--------------------+
  |                                        |
  |                                        | wasEndedBy
  |                                        v
  |                                      +--------------------+
  |                                      |   endingActivity   |
  |                                      +--------------------+
  |                                        |
  |                                        | wasAssociatedWith
  |                                        v
  |                 wasAssociatedWith    +--------------------+
  +------------------------------------> |   softwareAgent    |
                                         +--------------------+
                                           |
                                           | actedOnBehalfOf
                                           v
                                         +--------------------+
                                         |        user        |
                                         +--------------------+

Qualified relationships could be used for the associations if additional metadata needs to be included (such as timestamps).

Workflow Code

hello_world.cwl run with cwltool --provenance prov --enable-user-provenance --full-name 'Alejandro Villar' --enable-host-provenance hello_world.cwl

Your Environment