ResearchObject / ro-crate-py

Python library for RO-Crate
https://pypi.org/project/rocrate/
Apache License 2.0
46 stars 23 forks source link

Allow to attach partials to a crate? #146

Closed kinow closed 1 year ago

kinow commented 1 year ago

Hi,

For Autosubmit, since the workflow configuration doesn't contain the information needed for RO-Crate, I used the exact same approach from COMPSs and asked users to provide a YAML file with authors & license.

Then I create the objects and attach/add to the RO-Crate-py object.

The implementation in Autosubmit is similar, but not identical to COMPSs. Other workflow managers with similar need may craft yet another way of doing the same.

It would be nice if there was a way to load RO-Crate-py entities directly from a dictionary/YAML data. Something like


from rocrate.entities.person import Person

crate = ROCrate()

with open('') as f:
  yaml_content = parser.safe_load(f)

for author in yaml_content['authors']:
  crate.add(Person.load_from_dict(author)

Not sure how to validate the format of the entities... maybe instead of YAML receive JSON-LD directly, or provide a tool/script to read SPARQL+SHACL, etc.?

Cheers Bruno

simleo commented 1 year ago

I don't think this is feasible, mainly because RO-Crate is not just about entities but also their relationships. Suppose you somehow loaded authors from a file like this:

[
    {
        "@id": "https://orcid.org/0000-0002-1825-0097",
        "@type": "Person",
        "name": "Josiah Carberry"
    },
    {
        "@id": "https://orcid.org/0000-0000-0000-0000",
        "@type": "Person",
        "name": "John Doe"
    }
]

What do you link them to? What's the crate entity that was authored by these persons? You'd need a way to express that, but the user that writes the above file does not even know they're going to end up in an RO-Crate. Even if that was doable, you'd be basically asking users to write full RO-Crate markup. The engine's code that's generating the RO-Crate is the only one who has all the knowledge to make sense of the whole thing. It knows that e.g. those are workflow authors (because they were entered in a file meant to communicate that information) and it knows what's the crate entity that represents the workflow.

stain commented 1 year ago

See from @dgarijo @PavelAntonia approach on https://github.com/oeg-upm/ya2ro on using a YAML template for making an RO-Crate, which can even look up ORCIDs, e.g.:

type: "paper"

title: "DockerPedia: A Knowledge Graph of Software Images and their Metadata"

authors:
  - # Daniel Garijo
    orcid: http://orcid.org/0000-0003-0454-7145
    role: "Researcher"
  -
    name: "Maximiliano Osorio"
    position: "Research Programmer"
    description: "Computer Scientist at the Information Sciences Institute of the University of Southern California."

Could this be used to create a skeleton RO-Crate that then is augmented by the Autosubmit code?

stain commented 1 year ago

You could with ro-crate-py also load entities from a second RO-Crate of course. But copying them over may need some deep-copy logic in case they have references to other contextual entities (e.g. for affiliation)

kinow commented 1 year ago

See from @dgarijo @PavelAntonia approach on https://github.com/oeg-upm/ya2ro on using a YAML template for making an RO-Crate, which can even look up ORCIDs, e.g.:

That looks similar to COMPSs & Autosubmit current approach.

In COMPSs (cc @rsirvent) you have to use a YAML like

COMPSs Workflow Information:
  name: Name of your COMPSs application
  description: Detailed description of your COMPSs application
  license: Apache-2.0  # Provide better a URL, but these strings are accepted:
            # https://about.workflowhub.eu/Workflow-RO-Crate/#supported-licenses
  sources_dir: [path_to/dir_1, path_to/dir_2]  # Optional: List of directories containing application source files. 
            # Relative or absolute paths can be used
  sources_main_file: my_main_file.py  # Optional  Name of the main file of the application, located in one of the 
            # sources_dir. Relative paths from a sources_dir or absolute paths can be used
  files: [main_file.py, aux_file_1.py, aux_file_2.py] # List of application files
            # Relative or absolute paths can be used
Authors:
  - name: Author_1 Name
    e-mail: author_1@email.com
    orcid: https://orcid.org/XXXX-XXXX-XXXX-XXXX
    organisation_name: Institution_1 name
    ror: https://ror.org/XXXXXXXXX # Find them in ror.org
  - name: Author_2 Name
    e-mail: author2@email.com
    orcid: https://orcid.org/YYYY-YYYY-YYYY-YYYY
    organisation_name: Institution_2 name
    ror: https://ror.org/YYYYYYYYY # Find them in ror.org

And in Autosubmit I implemented it so users have to provide this info that's missing from our workflow configuration:

license: Apache-2.0 # Find in https://spdx.org/licenses/
authors:
  - name: Bruno P. Kinoshita
    email: bruno.depaulakinoshita@bsc.es
    orcid: https://orcid.org/0000-0001-8250-4074
    organisation_name: Barcelona Supercomputing Center
    ror: https://ror.org/05sd8tv96 # Find them in https://ror.org
 - name....

A common approach for these three implementations/tools would be really great.

You could with ro-crate-py also load entities from a second RO-Crate of course. But copying them over may need some deep-copy logic in case they have references to other contextual entities (e.g. for affiliation)

Maybe there's something in JSON-LD to combine schemas or files? If so then we could have these implementations asking users to provide a partial JSON-LD or YAML-LD and then just merge it with the RO-Crate metadata?

Thanks!

kinow commented 1 year ago

I spent the afternoon today reading about JSON-LD, RO-Crate, and reading the ro-crate-py code (and thinking :nerd_face: ).

From what I understood that the entities, mapped as Python classes in ro-crate-py, all take an class (related to the @type, Person for the schemaOrg Person I believe, ContextEntity for other entities not mapped), an ID (@id), and a dictionary that's used as the properties of the entity.

The properties are used as the _jsonld attribute of the entity.

So I think I could simplify the process of attaching external partial information to existing data within the crate, similar to how you would do Object.assign(existingObject, { '@type': 'Person' }` in JavaScript to add a new property to an existing object.

My idea is to provide a JSON file, with a similar structure to the ro-crate-metadata.json graph, but with objects that can be incomplete (as the crate will have the existingObject).

Here's what I sketched today (refrained from touching the code in Autosubmit until I have it clearer on my mind & on the paper) (oh, and using JS due to comments):

# TODO: add a section to AS docs explaining the idea behind it, link to schemaOrg page and playground, and provide examples for data missing from AS that is interesting for workflow authors/devs, such as license, inputs, outputs, etc

{
  # No context here, as these are partial/patches.
  "@graph": [
    # This is for the metadata itself, extra data that we want to add. Matching is through @id!
    {
      "@id": "ro-crate-metadata.json",
      "license": "https://spdx.org/licenses/Apache-2.0.html",
      "author": [
        {
          "@id": "https://orcid.org/0000-0001-8250-4074"
        }
      ]
    },
    # This is for the Autosubmit processed/unified workflow configuration.
    {
      "@id": "./",
      "author": [
        {
          "@id": "https://orcid.org/0000-0001-8250-4074"
        }
      ],
      "license": "https://spdx.org/licenses/Apache-2.0.html",
    },
    {
      "@id": "https://spdx.org/licenses/Apache-2.0.html",
      "@type": "CreativeWork",
      "identifier": "Apache-2.0",
      "name": "Apache License 2.0",
      "url": "https://www.apache.org/licenses/LICENSE-2.0"
    }
    # This is related to authorship & affiliation.
    {
      "@id": "https://orcid.org/0000-0001-8250-4074",
      "@type": "Person",  # When the @type is present, we will search it in the ro-crate-py classes and call crate.add(#type-class, @id, properties).
      "name": "Bruno P. Kinoshita",
      "affiliation": {
        "@id": "https://ror.org/05sd8tv96"
      },
      "contactPoint": {
        "@id": "mailto: blabla@bsc.es"
      }
    },
    {
      "@id": "mailto: blabla@bsc.es",
      "@type": "ContactPoint", # When the @type does not match a class, we will use ContextEntity.
      "contactType": "Author",
      "email": "blabla@bsc.es",
      "identifier": "blabla@bsc.es",
      "url": "https://orcid.org/0000-0001-8250-4074"
    },
    {
      "@id": "https://ror.org/05sd8tv96",
      "@type": "Organization",
      "name": "Barcelona Supercomputing Center"
    },
    # This is for the inputs & outputs.
    {
      "@id": "autosubmit-complete-workflow.yml",
      "input": [
        { "@id": "#param001" }
      ],
      "output": [
        { "@id": "#output001" }
      ]
    },
    {
      "@id": "#param001"... WIP
    }
  ]
}

With that I will simplify my code, and instead of parsing YAML and writing custom code to "stitch" things up, I will:

  1. extra the @id and search in the ROCrate object
  2. if found then just add everything that does not start with @ as properties
  3. if not found, and if I have @type, then 3.1. search for the @type value in the model classes of ro-crate-py (e.g. Person) 3.2. if found, then use that type 3.3. otherwise, use ContextEntity 3.4. call the crate.add($ENTITY_CLASS, $VALUE_OF_ID, $NON_@_ATTRS_AS_PROPERTIES)

I started working on the inputs & outputs for Autosubmit, but I couldn't find good examples on WorkflowHub.eu. I will comment on #148 as I think that's pertinent to the latest comments there.

-Bruno

simleo commented 1 year ago

If you're only going to allow contextual entities as new entities to be added, there's no need to search for an existing type in the model. Just add it as a ContextEntity, whether it's in the model or not. You have to keep the @type when you pass the properties. For instance:

org_dict = {
    "@id": "https://ror.org/05sd8tv96",
    "@type": "Organization",
    "name": "Barcelona Supercomputing Center"
}
org_id = org_dict.pop("@id")
org = crate.add(ContextEntity(crate, org_id, properties=org_dict))

For updates, removing all keys that start with @ seems the safest way to go. Keep the id and get the corresponding entity with crate.get. Then the fastest way to update the entity would be:

entity._jsonld.update(update_dict)