ResearchObject / ro-crate

Research Object Crate
https://w3id.org/ro/crate/
Apache License 2.0
79 stars 34 forks source link

Referencing sections of a document #200

Open simleo opened 2 years ago

simleo commented 2 years ago

While converting a cwltool --provenance RO to a Workflow Run RO-Crate, I'm faced with the problem of referring to individual workflow steps. The workflow is stored in "packed" form, meaning that the tools that implement each step are stored in the same packed.cwl document as the workflow. For the packed form, CWL uses the URI fragment syntax to assign IDs to the steps and the workflow itself; in this case, they are:

The workflow appears in the crate as a data entity with an @id of packed.cwl, so I decided to add the tools as SoftwareApplication entities with @id packed.cwl#rev and packed.cwl#sorted (whether this is correct is another matter: should they be packed.cwl#main/rev and packed.cwl#main/sorted instead?). Using fragments here seems quite reasonable, since the secondary resource is certainly "some portion or subset of the primary resource". However, should the tools be considered contextual entities or data entities? At first I tried to add them ad contextual entities:

crate.add(SoftwareApplication(crate, instrument_id, properties={
    "name": instrument_id,
}))

Leading to:

{
    "@id": "packed.cwl",
    "@type": ["File", "SoftwareSourceCode", "ComputationalWorkflow"],
    "hasPart": [
        {"@id": "#packed.cwl#rev"}
        {"@id": "#packed.cwl#sorted"},
    ],
    ...
},
...

Which does not really seem to work, due to the leading # in the tool IDs (ro-crate-py automatically adds a leading hash mark to contextual entity IDs if they're not full URIs: I'm not sure this is a MUST in the RO-Crate spec, but it's at least implied), so I tried adding them as data entities:

crate.add(DataEntity(crate, instrument_id, properties={
    "@type": "SoftwareApplication",
    "name": instrument_id,
}))

Leading to:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "packed.cwl"},
        {"@id": "packed.cwl#rev"},
        {"@id": "packed.cwl#sorted"},
    ...
    ],
    ...
{
    "@id": "packed.cwl",
    "@type": ["File", "SoftwareSourceCode", "ComputationalWorkflow"],
    "hasPart": [
        {"@id": "packed.cwl#rev"}
        {"@id": "packed.cwl#sorted"},
    ],
    ...
},
...

I think this is more correct since section IDs have a document_id "#" fragment structure. However, having packed.cwl#rev and packed.cwl#sorted listed in the crate's hasPart seems a bit weird. The current spec says "where files and folders are represented as Data Entities in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the hasPart property". However, these are not files, but file sections, and would still be linked indirectly (via packed.cwl) if removed from the crate's hasPart. Therefore, I think the spec should say that such "sections" MAY be listed.

I've made use of the workflow step example throughout the above discussion, but it actually generalizes to referencing sections of a document of any kind, when the document is part of the crate.

mr-c commented 2 years ago

so I decided to add the tools as SoftwareApplication entities with @id packed.cwl#rev and packed.cwl#sorted (whether this is correct is another matter: should they be packed.cwl#main/rev and packed.cwl#main/sorted instead?)

If used, it should be packed.cwl#main/rev and packed.cwl#main/sorted; there is neither a #rev nor #sorted in that document

simleo commented 2 years ago

Discussed at today's RO-Crate meeting:

stain commented 2 years ago

Right, packed.cwl#main/rev would be the way to refer to #main/rev within packed.cwl - CWL is unusual in that it has slash-based fragments, but this is also possible with XPath selectors for XML docs.

We could still add a section about referencing parts of other documents (which may even be contextual entities in another RO-Crate, some other Linked Data document, or just a section in a HTML/PDF), to clarify that you can use any URI/URI Reference with # in identifiers of contextual entities.