RDFLib / pySHACL

A Python validator for SHACL
Apache License 2.0
241 stars 63 forks source link

Focus Node equals Value Node in range shape #136

Closed KonradHoeffner closed 2 years ago

KonradHoeffner commented 2 years ago

This is most probably not a bug but a feature request to make the log more useful in a specific case. When I declare a shape for the common range for properties, the error message does not indicate where the error occurs.

Example

SHAPE

:EntityTypeRangeShape a sh:NodeShape;
    sh:targetObjectsOf
        meta:uses,
        meta:entityTypeComponent,
    sh:class :EntityType.

Data excerpt

bb:HealthCareInstitution a meta:Role.

bb:HealthCareNetwork
    meta:entityTypeComponent bb:HealthCareInstitution;
    rdf:type meta:EntityType.

pySHACL log

Constraint Violation in ClassConstraintComponent (http://www.w3.org/ns/shacl#ClassConstraintComponent):
        Severity: sh:Violation
        Source Shape: meta:EntityTypeRangeShape
        Focus Node: bb:HealthCareInstitution
        Value Node: bb:HealthCareInstitution
        Message: Value does not have class meta:EntityType

The log of pySHACL does not show me the offending subject and predicate, only the object both for the focus and the value node. However when I edit RDF turtle files in a text editor, the triples are grouped by subject, so it is quite cumbersome to find out, which statement caused the error exactly. Alternatively or additionally, it would also be helpful to see on which line in the file the violation occurs. It is easy to see in this minimal example but in reality the file has thousands of triples and many usages of the same object.

How I wish the pySHACL log would be

Constraint Violation in ClassConstraintComponent (http://www.w3.org/ns/shacl#ClassConstraintComponent):
        Severity: sh:Violation
        Source Shape: meta:EntityTypeRangeShape
        Subject Node: bb:HealthCareNetwork
        Property Node: meta:entityTypeComponent
        Value Node: bb:HealthCareInstitution
        Message: Value does not have class meta:EntityType
        Place: Line 1154 in file bb.ttl
ashleysommer commented 2 years ago

I know you're talking about the Human textual output here, which could potentially include any output we want, but in PySHACL, the human output is built from the SHACL Validation Results Graph output. The validation graph is made up of Validation Result objects. Those are what we see as "Constraint Violation" blocks in the textual output.

The Validation Result objects can each have a set of properties that are defined in the Specification Document. They are:

3.6.2.1 Focus node (sh:focusNode) 3.6.2.2 Path (sh:resultPath) 3.6.2.3 Value (sh:value) 3.6.2.4 Source (sh:sourceShape) 3.6.2.5 Constraint Component (sh:sourceConstraintComponent) 3.6.2.6 Details (sh:detail) 3.6.2.7 Message (sh:resultMessage) 3.6.2.8 Severity (sh:resultSeverity)

The manner in which each of those properties are populated for each given Shape kind and Constraint kind is also specified in document. There is very little wiggle room to change the format.

As for your requests: 1) Subject Node: bb:HealthCareNetwork

When using targetObjectsOf, the SHACL engine finds all nodes which are the object of that given predicate, and uses that as both the Focus Nodes and Value Nodes for the constraints. When the constraints are evaluated against those value nodes, neither the RDF engine or the constraint component has any information about the Subject Node for which the value was defined. If you wish to capture that information, you could change your SHACL Shape to target that subject (using targetSubjectsOf) and use a sh:property shape with a path definition to get your value nodes.

Eg, something like this:

:EntityTypeRangeShape a sh:NodeShape;
    sh:targetSubjectsOf
        meta:uses,
        meta:entityTypeComponent ;
    sh:property [
        sh: path [ sh:alternativePath ( meta:uses meta:entityTypeComponent  ) ] ;
        sh:class :EntityType ;
    ] ;

(I haven't tested that, it may not work as expected, it is an example of an alternative solution).

2) Place: Line 1154 in file bb.ttl

PySHACL runs validation on an RDF Graph, not a file. If you give PySHACL a file path, it will parse it into a graph before validating, but it has no knowledge of the original file after it is loaded. When validating against a graph, it is operating on Nodes in that graph, not lines in a file. There is no inherent relationship between a node in a graph, and the line of the file it came from.

A node in a graph may not have even come from a file, eg, a graph may be constructed by queries from an API endpoint (like a SPARQL Graph), or it may have been built by an algorithm at runtime, or inflated from a different graph using inferencing expansion rules. In all those cases, there is no "Place" that any given node came from, so its not possible to reflect that in the ValidationResult output.

I hope this sheds some light on why the output is how it is, and yes I wish it could be more informative for tracking down bugs too.

KonradHoeffner commented 2 years ago

Wow, thanks for the extremely detailed explanation! I really appreciate that you take so much time to answer my questions.

I will rewrite the shape as you suggest and also split it up between the different properties (only one object for sh:targetSubjectsOf each) so that the name of the property is included in the shape in order to get that information as well. While that will increase the length of the SHACL file, the improved debugging will make it more than worth it because right now I sometimes have to manually go through 100 or more triples to find the right one. Maybe I will write a script to automate the process to transform my domain and range statements into SHACL.

It is unfortunate that the line number cannot be found out because then one could integrate it into an IDE like Visual Studio / OSS Code and get instant feedback on Turtle files like with a programming language.

ashleysommer commented 2 years ago

It is unfortunate that the line number cannot be found out because then one could integrate it into an IDE like Visual Studio / OSS Code and get instant feedback on Turtle files like with a programming language.

I have kind of some rough ideas about how I might be able to implement that, but it will have to be done at a very low level, eg in the RDFLib Turtle Parser. It would be a big effort, but I know a lot of people would benefit from a feature like that.