ResearchObject / workflow-run-crate

Workflow Run RO-Crate profile
https://www.researchobject.org/workflow-run-crate/
Apache License 2.0
8 stars 9 forks source link

Generate crate from CWLProv #5

Closed simleo closed 2 years ago

simleo commented 2 years ago

Adds a tool to generate a Workflow Run RO-Crate from CWLProv output. For now it's monolithic.

simleo commented 2 years ago

I've tried to map all (hopefully!) CWL types, using test/data/type-zoo-run-1/snapshot/type_zoo.cwl as a test case. Here are some notes on each:

string

Map to Text.

{
    "@id": "#param-main/in_str",
    "@type": "FormalParameter",
    "additionalType": "Text",
    "name": "main/in_str"
},
{
    "@id": "#pv-main/in_str",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#param-main/in_str"},
    "name": "main/in_str",
    "value": "spam"
}

array

Map to the element type (e.g., string[] to Text). A property in RO-Crate can have a single value or multiple values, so there should be no need to do anything special here.

{
    "@id": "#param-main/in_array",
    "@type": "FormalParameter",
    "additionalType": "Text",
    "name": "main/in_array"
},
{
    "@id": "#pv-main/in_array",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#param-main/in_array"},
    "name": "main/in_array",
    "value": ["foo", "bar"]
}

If we want to be more specific, we could add a PropertyValueSpecification with multipleValues set to True. I'm not sure how to tie it to the parameter though -- I briefly skimmed through https://www.w3.org/wiki/images/1/10/PotentialActionsApril11.pdf, it looks like it could be relevant.

On a side note, it looks like ro-crate-py does not handle "value": ["foo", "bar"] correctly, we need to check Entity's magic item getter / setter.

Any

I'd map this to DataType. However, in the converter I'm not currently parsing the workflow file, and the provenance files don't have this information: the type is reported as xsd:string (i.e., the type of the actual value that was passed in the job config file) in the XML, and for now I'm just inferring the type from the deserialized JSON object anyway.

boolean

Map to Boolean.

{
    "@id": "#param-main/in_bool",
    "@type": "FormalParameter",
    "additionalType": "Boolean",
    "name": "main/in_bool"
},
{
    "@id": "#pv-main/in_bool",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#param-main/in_bool"},
    "name": "main/in_bool",
    "value": "True"
},

Conveniently, str(True) yields "True", which represents True, and the same goes for False. So we can simply use str to serialize booleans.

int, long

Map to Integer.

{
    "@id": "#param-main/in_int",
    "@type": "FormalParameter",
    "additionalType": "Integer",
    "name": "main/in_int"
},
{
    "@id": "#pv-main/in_int",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#param-main/in_int"},
    "name": "main/in_int",
    "value": "42"
},

Note that the value is serialized as a string: I followed https://schema.org/PropertyValue#eg-0404 (JSON-LD tab). Is there any relevant Schema.org recommendation anywhere? The RO-Crate spec should probably say something about this.

float, double

Map to Float, same considerations as int / long.

multiple types

Map to array of mappings of each type, e.g., [int, float] to ["Integer", "Float"]. This info is not available from the provenance files, so for now the converter is reporting the type of the value passed in the job config.

What should we do about optional params, e.g., [int, "null"]? Again, PropertyValueSpecification might be useful here, since it has a valueRequired property.

enum

Map to Text. Not sure if there's a way to specify a set of predefined allowed values.

record

Map to PropertyValue. It actually maps to an array of PropertyValues, but RO-Crate properties can have multiple values, so it's basically the same considerations we made for array. To serialize the actual value of {"in_record_A": "Tom", "in_record_B": "Jerry"}, I've used nested PropertyValues, with the record keys as additional slash-separated fields in the @id:

{
    "@id": "#param-main/in_record",
    "@type": "FormalParameter",
    "additionalType": "PropertyValue",
    "name": "main/in_record"
},
{
    "@id": "#pv-main/in_record",
    "@type": "PropertyValue",
    "exampleOfWork": {"@id": "#param-main/in_record"},
    "name": "main/in_record",
    "value": [
        {"@id": "#pv-main/in_record/in_record_A"},
        {"@id": "#pv-main/in_record/in_record_B"}
    ]
}
{
    "@id": "#pv-main/in_record/in_record_A",
    "@type": "PropertyValue",
    "name": "main/in_record/in_record_A",
    "value": "Tom"
},
{
    "@id": "#pv-main/in_record/in_record_B",
    "@type": "PropertyValue",
    "name": "main/in_record/in_record_B",
    "value": "Jerry"
},
simleo commented 2 years ago

Note on mapping workflow-level parameters to step-level parameters

In general, this is / should be available from the prospective provenance part (workflow file). For instance, packed.cwl has entries like:

{
  "source": "#main/input",
  "id": "#main/rev/input"
}

Where "source" represents the workflow input and "id" the step input. However, in the case of files, primary.cwlprov.* alone is sufficient to infer such mappings, since the two roles eventually map to the same artifact.

simleo commented 2 years ago

Merging so we get the rendering of the revsort example (should appear at https://www.researchobject.org/workflow-run-crate/examples/draft/revsort-run-1-crate/). We can do more work on this in future PRs