RDFLib / pySHACL

A Python validator for SHACL
Apache License 2.0
241 stars 63 forks source link

SPARQL Constraint Not Handling Multiple Results #190

Closed dandunbar23 closed 12 months ago

dandunbar23 commented 12 months ago

For SHACL SPARQL constraints with multiple results for a single target node, pySHACL only returns a single result. For example, consider the data graph:

@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh:    <http://www.w3.org/ns/shacl#> .
@prefix ex:    <http://www.semanticweb.org/shacltest#> .

ex:has_relation rdf:type owl:ObjectProperty .

ex:Class1 rdf:type owl:Class .

ex:Class2 rdf:type owl:Class .

ex:Class3 rdf:type owl:Class .

ex:instance1Class1 rdf:type owl:NamedIndividual ,
                          ex:Class1 ;
                ex:has_relation ex:instance1Class2 ;
                ex:has_relation ex:instance2Class2 ;
                ex:has_relation ex:instance3Class2 .

ex:instance1Class2 rdf:type owl:NamedIndividual ,
                          ex:Class2 ;
                ex:has_relation ex:instance1Class3 .

ex:instance2Class2 rdf:type owl:NamedIndividual ,
                          ex:Class2 .

ex:instance3Class2 rdf:type owl:NamedIndividual ,
                          ex:Class2 .

ex:instance1Class3 rdf:type owl:NamedIndividual ,
                          ex:Class3 .

And the Shape file:

@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sh:    <http://www.w3.org/ns/shacl#> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ex:    <http://www.semanticweb.org/shacltest#> .
@prefix owl:   <http://www.w3.org/2002/07/owl#> .

ex:multipleSPARQLResults
    a sh:NodeShape ;
    sh:targetClass ex:Class1;
    sh:sparql [
        a sh:SPARQLConstraint ;
        sh:message "{?Class2Instance} does not have a related Class3 Instance" ;
        sh:select  """
            PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
            PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
            PREFIX owl: <http://www.w3.org/2002/07/owl#>
            PREFIX ex: <http://www.semanticweb.org/shacltest#>
            select * where {
                ?this a ex:Class1 .
                ?Class2Instance a ex:Class2 .
                ?this ex:has_relation ?Class2Instance .
                OPTIONAL {
                    ?Class2Instance ex:has_relation ?Class3Instance .
                    ?Class3Instance a ex:Class3 .
                }
                FILTER (!bound(?Class3Instance))
            }
        """ ;
        ] .

Here, the SPARQL constraint considers a target node (instance of Class1) with a rule that states that every every instance of Class2 related to the Class1 Instance target node must have a relation to a Class3 instance. The data graph shows one instance that passes (instance1Class2) and two that fail (instance2Class2 and instance3Class2). The SPARQL query results in both of these identified and returned. However, pySHACL only reports a single failure.

This occurs because of lines 183-184 in sparql_based_constraints.py:

                    if (t, p, v) in dedup_set:
                        continue

Where t is the ?this variable in SPARQL and the target node. After the first result is logged, the (t, p, v) tuple is added to dedup_set. Thus, additional results (although they represent additional, unique failures) are not captured because of the guard statement above. If I comment out lines 183-184, it works as expected and shows two failures.

I'm not sure if there is a spec issue that requires the dedup_set guard, but if there isn't, I'd recommend removing those lines to allow for multiple results. I'm new to open source contribution, but if there is agreement, I can try to make the changes myself and issue a PR.

Thank you for the help!

ashleysommer commented 12 months ago

Hi @dandunbar23 Thank you for the very detailed and helpful issue report. Your detective work is appreciated and it appears you have found a valid bug.

I believe it should be fine to simply remote that dedup logic as you suggested, but first I will need to double check the spec and run some validation tests on that chage. I did put that dedup check in for a reason, but I forget what that reason was and it may no longer be needed.

dandunbar23 commented 12 months ago

Thanks @ashleysommer

I created PR #191, but instead of just removing it, I added a check on the violations list for duplicates. Since the violations list includes the var_dict, which contains results specific data, it allows for a check on the SPARQL results for duplicates while still returning multiple results for the same target node. Let me know what you think.

Thanks again.

ashleysommer commented 12 months ago

Perfect, thanks

ashleysommer commented 12 months ago

Fixed by #191