shacl - Background knowledge for validation

aidig commented 4 years ago

This is not - as such - a new issue, but an attempt to highlight and generalize problems raised by init-dcat-ap-de in #115, #116, #116 and #117 where the lack of background knowledge for the validation results in several error types (sh:Violation) when attempting a shacl validation using the DCAT-AP validator: https://www.itb.ec.europa.eu/shacl/dcat-ap/upload

Similarily, all attempts to classify resources directly by using the URI of the skos:Concept individual will produce an error (violation). Not a warning or a message ... an error.

This is problematic.

Although it is fully understandable that the ambition is to make the SHACL constraints as close to the constraints expressed in the specification, it might lead to datasets being described with less details (eg. the contact point is described as a vcard:Kind - although vcard:Organization would be correct and more precise #115) or the publisher having to add quite a lot of background knowledge explictly in the dataset description - doing the job of a reasoner? (eg. specifying that a given landingPage url is in fact a foaf:Document #116)

Furthermore, it is also very interesting to note that examples provided by DCAT 2.0 will produce several shacl violations of the above-mentioned type with the current constraints. (https://github.com/w3c/dxwg/tree/gh-pages/dcat/examples)

Perhaps the severity of these shape types could be weakned from sh:Violation to sh:Warning or even sh:Message?

aidig commented 4 years ago

Also, very much agree with dcat-ap-de that examples of valid DCAT-AP dataset descriptions would be very useful indeed (https://github.com/SEMICeu/DCAT-AP/issues/121) especially seeing that the examples provided by W3C cannot be validated by the DCAT-AP validator.

bertvannuffelen commented 4 years ago

@aidig thanks for expressing this issue so clearly. It is indeed the case that the current shacl validation rules implement a very strict interpretation of all the constraints in the specification.

To address this issue, the DCAT-AP community should agree upon a generic approach for each of the validation rules. We should avoid the case that for one range constraint is a error level, for another warning and for a third just informative. We need clear rules otherwise it becomes very unmaintainable.

Connected with this is of-course what is the purpose of the DCAT-AP SHACL validation rules. Are these in the distritution the canonical implementation of the contraints in a machine readible way. Or are they to be used as is in any implementation context like the EDP? This impacts the organisation of the files, but also how the interpretation is being done.

What is the relationship of the SHACL specificication with the human readible specification. E.g. if all range contraints are informative then we should make that clear in the human readible specification. Currently it reads as MUST. So it is logical that the SHACL interprets MUST as error.

You also pointed out one of the key interpretation choices of using the SHACL rules. Are they being used with inference or not. May we assume that the data exchange does correct inference and that the state obtained after the inference still satisfies the constraints if the constraints are satisfied before? What is the background knowledge that is being assumed for the execution of the validation proces?

So to the community: what are your answers to these so that we can build a SHACL distribution that is complete (all constraints), corresponding to the human text, supportive to create validation processes. I am looking forward to your feedback.

init-dcat-ap-de commented 4 years ago

SHACL is "for validating RDF graphs against a set of conditions." If DCAT-AP offers SHACL shapes, we should be able to use them in order to check if a RDF dataset is valid DCAT-AP data, as the EDP is using them at the moment. (We need them not only for DCAT-AP 2.0 but also for 1.2.1, but that's another story...)

At the moment, they can't be use for this purpose, because e.g. the object of dct:language will not have the class dct:LinguisticSystem. They can't be a direct convertion from the OWL ontology (and why should they, it would be the same information in another dialect...)

Due to the missing inferencing of SHACL validators and them not following IRIs to external sources, I feel like the provided rules should be the minimum of what is considered a valid DCAT-AP RDF. At least for the rules that are a sh:Violation. Maybe it would be reasonable to provide two sets of rules, the minimum and an additional set with advanced rules?

aidig commented 4 years ago

Further related info can be found on the JoinUp page on the SHACL shapes in DCAT-AP context webinar - 26/06/2020: https://joinup.ec.europa.eu/collection/semantic-interoperability-community-semic/news/shacl-shapes-dcat-ap-webinar

The agenda included:

the organisation of the SHACL templates (files, per constraint, message texts, etc.);
the usage of the SHACL shapes for validation (which background knowledge to include);
handling implementation-specific requirements (addressing differences between the European Data Portal implementation and the DCAT-AP specifications); and
rules about how to express the constraints in the DCAT-AP specifications in SHACL.

On the importance of providing background knowledge with SHACLtemplate for validation, the SEMIC Team proposed two sets of solutions.

• For the DCAT-AP specification: to create SHACL constraints for class membership in a separate file and create options in the DCAT-AP validator with/without class-membership; • For the implementation part: the implementation could publish its constraints and assumptions against which it validates the input. This can be done by an aggregated SHACL file based on the DCAT-AP SHACL files.

SEMICeu / DCAT-AP

shacl - Background knowledge for validation #125