crs4 / rocrate-validator

A Python package to validate RO-Crates
Apache License 2.0
6 stars 7 forks source link

LinkML support for future profile development #2

Open elichad opened 2 months ago

elichad commented 2 months ago

Expanding on a discussion with @simleo in the WRROC meeting.

At Manchester we've just started trying to use LinkML to write schemas for RO-Crate validation. LinkML schemas are YAML-based and therefore a lot easier for inexperienced users to comprehend and add to - and crucially, they can also be converted to SHACL. We think that it's important for RO-Crate profile developers to be able to write a validation schema for their profile themselves, and LinkML is a more approachable framework than SHACL to achieve this (as profile developers may not be linked data/RDF experts).

There has been interest and discussion around this previously: see https://github.com/ResearchObject/ro-crate/issues/264 and https://github.com/linkml/linkml/issues/1462

Thinking about how future profiles could be developed using LinkML in a way that's compatible with this validator package, there are a few possible approaches:

Please let me know your thoughts about what the best direction would be.

multimeric commented 1 month ago

The LinkML repo is published on PyPI, so could be added as a dependency. The same goes for pySHACL.

I'm not really sure of the advantage of supporting LinkML directly, since it will always be converted to SHACL and so embracing SHACL makes more sense to me. For this reason I think it would make sense to create a separate repository with RO-Crate schemas in SHACL format, then add a shacl extra to this package that pulls in pySHACL and that repo to validate against. That way it doesn't make the installation heavier for people who are using other validation standards.

If you wanted to also support LinkML then you could create another repo with the LinkML, then add a linkml extra that pulls in the linkml package and that schema. Then when the validator runs, it does the conversion to SHACL and validates it. This could be done as a second step though, so as to work in manageable chunks.

Happy to help with any of this.

ilveroluca commented 1 month ago

Hi all,

this seems like a good proposal @elichad. We were discussing it with @kikkomep and @simleo just yesterday and we’re all in agreement that being able to use it as an alternative to SHACL could make adding support for additional profiles more approachable.

Our first impression is that the best way to start integrating LinkML support would be the second approach you suggested:

  • include LinkML schemas and their SHACL conversions within the repo, such that developers can update the LinkML and the SHACL conversion is automatically generated for the validation code to use

The “automatically” word needs some discussion though. We could have a directory within the package for the LinkML profiles, but the profiles wouldn’t be actually used at run time. Instead, we’d propose the simple solution of having the profiles converted to SHACL as part of the development or packaging process. This should make it easier to test the converted profiles and fix things as necessary before release; keeping the conversion process prior to run time should also help make the tool more robust and easier to debug. Since we discussed this yesterday, @multimeric joined the conversation and also made some points that we should discuss together.

To implement the LinkML -> SHACL conversion it looks like we can use the SHACL generator you referenced. @kikkomep ran some experiments and managed to successfully create a LinkML validation profile, convert it to SHACL with the generator and use it within rocrate-validator. The process did expose some small bugs in the internal SHACL parsing (which have been fixed) and there is an open issue with respect to how to manage severities. For the conversion, there could be either a dedicated script or subcommand that runs the conversion and lays out the resulting ttl following the directory structure used by rocrate-validator for the validation profiles. The profile.ttl file would have to be created manually (though, if we wanted, it wouldn’t be too hard to create a little script to guide the collection of the required metadata).

As I was saying, one thing that needs some careful thinking is how to attach severities (MUST, SHOULD, MAY) to the LinkML checks. A solution could be to use annotations, but that would need support from the conversion script/subcommand to parse that information out of the resulting ttl and use it to lay out the checks appropriately in the directory structure. Another alternative, still using LinkML annotations, would be implementing additional SHACL parsing in rocrate-validator to extract the severity annotations (there's already some parsing to extract metadata). We'd be happy to hear other better/simpler alternatives.

As for helping, we're happy to receive and support PR's on this issue. Let's just agree on the approach before anyone starts hacking :-)

elichad commented 1 month ago

Next steps after discussion at the Workflow Run RO-Crate meeting today:

Make a proof of concept LinkML-SHACL integration, to check that LinkML is a viable option for writing profiles:

After that (assuming LinkML is shown to be viable), we'll look at adding validation for the Five Safes Crate profile with this LinkML-SHACL approach, as this would be useful for our team at Manchester.

We'll work on this on the Manchester side, I've just made a fork which we'll contribute back from: https://github.com/eScienceLab/rocrate-validator

kikkomep commented 1 month ago

The PR #8 introduces support for the severity property in both SHACL and Python requirement checks. Specifically, SHACL requirements can directly use the SHACL sh:severity (sh := https://www.w3.org/ns/shacl#) property to define the severity of a constraint. The folder structure typically used in validation profiles — consisting of the must, should, and optional folders, which assign severity levels to the requirement checks - is still supported but not mandatory.

This feature should simplify the process of converting a LinkML specification to SHACL, as the output from the conversion process can be directly used by the validator without requiring the creation of the mentioned folder structure. From my experiments, simply annotating the LinkML slots with the sh:severity property should be sufficient to correctly assign the severity levels to each constraint. You can also use annotations to customize the name, description of a requirement check, and the corresponding error message, if needed, as shown in the following example:

Person:
    is_a: NamedThing
    description: >-
      A person....
    class_uri: schema:Person
    slots:
      - primary_email
    slot_usage:
      primary_email:
        pattern: "^\\S+@[\\S+\\.]+\\S+"
        recommended: true
        annotations:
          sh:severity: sh:Warning
          sh:name: "Primary Email Validation"
          sh:description: "This requirement checks the validity of the primary email address."
          sh:message: "The primary email address is not valid."
...

By using the LinkML-SHACL conversion tool with the --include-annotations option to include SHACL annotations in the generated SHACL files, you should obtain a SHACL shape that can be directly used by the validator:

schema1:PersonTest a sh:NodeShape ;
    rdfs:subClassOf personinfo:NamedThing ;
    sh:closed true ;
    sh:description "A person...." ;
    sh:ignoredProperties ( rdf:type ) ;
    sh:property [ 
      sh:datatype xsd:string ;
      sh:description "This requirement checks the validity of the primary email address."^^xsd:string ;
      sh:maxCount 1 ;
      sh:message "Primary email address is not valid."^^xsd:string ;
      sh:name "Primary Email Validation"^^xsd:string ;
      sh:nodeKind sh:Literal ;
      sh:order 0 ;
      sh:path schema1:email ;
      sh:pattern "^\\S+@[\\S+\\.]+\\S+" ;
      sh:severity sh:Warning 
    ],
  ...

All that remains is to place it in the appropriate folder within your validation profile.

elichad commented 1 month ago

@kikkomep amazing! Thank you for implementing this so quickly!

multimeric commented 6 hours ago

Hi all, have there been any recent updates on LinkML implementation here?