RDFLib / pySHACL

A Python validator for SHACL
Apache License 2.0
245 stars 63 forks source link

How pySHACL validates classes with parents seems to be wrong #173

Closed vemonet closed 1 year ago

vemonet commented 1 year ago

Hi, while using pySHACL to validate RDF complying with the BioLink model I noticed that how pySHACL handles validation of parents classes of a class is wrong.

Because pySHACL tries to validate an entity complies with the shapes of all its parent classes.

Reproduce

Install dependencies:

pip install pyshacl rdflib

Run this script:

import pyshacl
from rdflib import Graph

validate_rdf = {
    "@context": {
        "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
        "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
        "dct": "http://purl.org/dc/terms/",
        "biolink": "https://w3id.org/biolink/vocab/"
    },
    "@type": "biolink:Drug",
    "biolink:category": "biolink:Drug",
    "biolink:id": "drugbank:DB00001",
    "rdfs:label": "Lepirudin",
    "dct:description": "Lepirudin is a protein-based direct thrombin inhibitor."
}

validate_g = Graph()
validate_g.parse(
    data=validate_rdf,
    format="ttl"
)

shacl_g = Graph()
shacl_g.parse(
    "https://raw.githubusercontent.com/vemonet/biolink-model/add-shacl-gen/biolink-model.shacl.ttl",
    format="ttl"
)

biolink_g = Graph()
biolink_g.parse(
    "https://raw.githubusercontent.com/biolink/biolink-model/master/biolink-model.owl.ttl",
    format="ttl"
)

conforms, _, results_text = pyshacl.validate(
    validate_g, 
    shacl_graph=shacl_g,
    ont_graph=biolink_g
)

print(results_text)

I tested it with python 3.9.15 on Fedora 37

You will get 23 errors

If you look at those errors, all of them are ClosedConstraintComponent, and the source shapes are all parents of the biolink:Drug shape.

What happens

By adding the BioLink ontology it adds the subclassOf hierarchy. This leads pySHACL to try to evaluate a node against the shapes of all its parent classes. This leads to misleading errors, especially if the parent shapes are closed.

For example, I want to validate an entity is a biolink:Drug, currently pySHACL tries to validate that my entity complies with the shapes of every parent classes (e.g. biolink:ChemicalSubstance and biolink:NamedThing).

But it does not seems to be correct, because my Drug might have additional properties that are not described in biolink:ChemicalSubstance and biolink:NamedThing. And pySHACL will always fail validating such valid classes right now

How to fix it

I managed to add a quick fix for this behavior here by commenting the part that was adding parent classes shapes to a node target: https://github.com/vemonet/pySHACL/blob/e51c658bba4dee896e54be243d191f7dd8b321c0/pyshacl/shape.py#L343:L350

Questions

Note that I don't know much about how pySHACL works, so I might be missing something, maybe there is a valid reason behind those checks, and the real issue comes from somewhere else. What is this reason? Why do we need these checks?

What could be a good solution to move forward? I can send a PR if you are interested by those changes already

@ashleysommer @ajnelson-nist

ashleysommer commented 1 year ago

Hi @vemonet I've taken some time to look into this issue that you've described.

Let me first take a moment to check that I correctly understand the issue you are facing:

#shapes.ttl
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ex: <http://example.org#> .

ex:AnimalShape rdf:type sh:NodeShape ;
  sh:targetClass ex:Animal ;
  sh:property [
      sh:path ex:name;
    ];
  sh:property [
      sh:path ex:legsCount;
      sh:minInclusive 0;
    ];
  sh:closed true;
  sh:ignoredProperties ( rdf:type )
  .

ex:BirdShape rdf:type sh:NodeShape ;
  sh:targetClass ex:Bird ;
  sh:property [
    sh:path ex:name;
  ];
  sh:property [
      sh:path ex:wingsCount;
      sh:minInclusive 0;
    ];
  sh:property [
      sh:path ex:legsCount;
      sh:minInclusive 0;
    ];
  sh:closed true;
  sh:ignoredProperties ( rdf:type )
  .
#data.ttl
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ex: <http://example.org#> .
@prefix data: <http://data.org#> .

ex:Animal rdf:type rdfs:Class .
ex:Bird rdf:type rdfs:Class .
ex:Bird rdfs:subClassOf ex:Animal .

data:Cheetah;
  rdf:type ex:Animal ;
  ex:name "Cheetah";
  ex:legsCount 4;
 .

data:Eagle;
  rdf:type ex:Bird ;
  ex:name "Eagle";
  ex:legsCount 2;
  ex:wingsCount 2;
 .

I have included the RDFS typing information in the data graph, to avoid needing three graphs in this example.

(I am assuming you are running with no RDFS or OWL pre-inferencing enabled in the validator, because the issue you described only occurs when you are expecting no inferencing associations to be present.)

When I run validator on the example above, I get the following validation report:

Constraint Violation in ClosedConstraintComponent (http://www.w3.org/ns/shacl#ClosedConstraintComponent):
    Severity: sh:Violation
    Source Shape: ex:AnimalShape
    Focus Node: data:Eagle
    Value Node: Literal("2", datatype=xsd:integer)
    Result Path: ex:wingsCount
    Message: Node data:Eagle is closed. It cannot have value: Literal("2", datatype=xsd:integer)

Correct me if I'm wrong, as I understand it you are asserting that this example should pass validation. You are stating that the entity data:Eagle should not be a focus node of the NodeShape ex:AnimalShape, is that right?

Indeed, commenting out that section of code you mentioned, it skips adding subclasses of ex:Animal to its list of focus nodes for ex:AnimalShape, and this example does then validate successfully.

I'm sorry to conclude in this case the first result is true, and PySHACL is acting correctly. By commenting out that section of code, PySHACL no longer passes its internal W3C conformance tests, and would no longer be considered a compliant SHACL validator implementation.

The definition in the W3C SHACL Specification document for the implementation of sh:targetClass and implicit class targeting is quite clear. It states:

If s is a shape in a shapes graph SG and s has value c for sh:targetClass in SG then the set of SHACL instances of c in a data graph DG is a target from DG for s in SG.

That is hard to follow, but the important part is the definition of SHACL instances:

SHACL Class Instance A node n in an RDF graph G is a SHACL instance of a SHACL class C in G if one of the SHACL types of n in G is C.

And the definition of SHACL types:

SHACL Type The SHACL types of an RDF term in an RDF graph is the set of its values for rdf:type in the graph as well as the SHACL superclasses of these values in the graph.

And finally, the definition for SHACL superclasses:

SHACL Subclass, SHACL superclass A node Sub in an RDF graph is a SHACL subclass of another node Super in the graph if there is a sequence of triples in the graph each with predicate rdfs:subClassOf such that the subject of the first triple is Sub, the object of the last triple is Super, and the object of each triple except the last is the subject of the next. If Sub is a SHACL subclass of Super in an RDF graph then Super is a SHACL superclass of Sub in the graph.

The part of the code you have commented out is exactly the part that implements the subclass/superclass directives in the targeting specification described above, and removing it breaks compatibility with the specification.

If I have misunderstood the issue, please let me know, and we can explore it further.

If you still disagree with the intended behaviour, then it is no longer an issue with PySHACL, rather an issue with the SHACL standard itself. Either way, the best places to discuss this topic further are the SHACL W3C mailing list, and SHACL community Discord server, folks on there will be able to give you much better advice than I can.

ajnelson-nist commented 1 year ago

Hi @vemonet ,

The issue I see in the SHACL shapes is usage of sh:closed, which might have a farther-reaching (-constricting) effect than you may realize.

Per the SHACL spec, 4.8.1:

The SHACL Core language includes a property called sh:closed that can be used to specify the condition that each value node has values only for those properties that have been explicitly enumerated via the property shapes specified for the shape via sh:property.

Re-iterating: sh:closed disallows all extensibility from properties not mentioned via sh:property. rdf:type seems to get a pass for bootstrapping reasons (SHACL needs to know types for targeting purposes), but if there was no sh:PropertyShape for, say, rdfs:comment, your SHACL validation would fail if you put an rdfs:comment on the individual-node you're trying to validate.

So, this is going to be a conflict with your remark in your problem description:

my Drug might have additional properties that are not described in biolink:ChemicalSubstance and biolink:NamedThing.

biolink:NamedThing has sh:closed set to true. This unfortunately hampers making subclasses of biolink:NamedThing that will want to apply their own properties.

So, in my opinion, your shapes can't work with use of sh:closed. I haven't seen the documentation for this ontology, so I don't know why you were motivated to try sh:closed. My community had a similar desire for concept IRI typo-checking within our community namespaces, but had to resort to writing a piece of extension functionality in a pyshacl wrapper. What we wanted probably would have been satisfied if SHACL had a mechanism for something like sh:closed, but just scoped to a certain prefix.

vemonet commented 1 year ago

Thanks a lot @ashleysommer and @ajnelson-nist for the detailed explanations and suggestions! In our case the best solution will be probably to discuss with the ontology community to pass sh:closed as false

@ashleysommer you properly understood my assertion, I am not using rdfs inference, so it was surprising to face issues related to rdfs:subclassOf (note that I face the same error when enabling inference="rdfs" though). I understand it is directly stated by the SHACL specs, but then it feels like in practice closed=true can not be used with rdfs:subclassOf, even if inference is disabled...

A shape by essence aims to specify the constraints to define an entity, but if this entity needs to comply with the constraints of all its superclasses, then the constraints need to be loose (e.g. not closed). Which makes the constraints lose all their interest! I will dig a bit more to better understand, thanks again for all the pointers and helpful explanations