fluree / db

Fluree database library
https://fluree.github.io/db/
Other
333 stars 21 forks source link

improve SHACL error messages #755

Closed dpetran closed 2 months ago

dpetran commented 3 months ago

This is a complete overhaul of our SHACL validation. We now produce something similar to a sh:ValidationReport, and are much more closely aligned with the SHACL spec and official compliance test suite.

Fixes https://github.com/fluree/db/issues/565

Architectural outline

Before, we would try to query for shapes that targeted any modified subject. This was just a heuristic, and was not necessarily 100% correct. Now, we assume that all node shapes have @type sh:NodeShape, and we fetch them all. This ensures that nothing is missed. It also allows us to remove the subj-mods atom. There is a :schema :shapes cache on the db that is now just a shape-sid->shape map, which is reset if there are any new flakes with type sh:NodeShape, sh:PropertyShape, or sh:path (since property shapes will rarely have type sh:PropertyShape, but are required to havesh:path).

Then, for each modified subject, we check each shape, first checking that it is active (via official support for sh:deactivated), and then we check each shape's target declarations to see if the modified subject is targeted. If it is targeted, we proceed to validate the constraints in the shape.

Before we only supported sh:targetClass and sh:targetObjectsOf target declarations. Now we support all SHACL target declarations:

Constraint validation is driven by the validate-constraint multimethod. Each validator takes a shape, a focus node, a set of value nodes, and returns nil if there are no violations, or a sequence of sh:ValidationResult-like results. Each subject has every constraint validated to produce a complete (as opposed to partial) list of violations. Once all the constraints have been checked, if a subject has any violations we halt further processing and throw a validation error.

Design decisions

In the course of implementing this, I made some design decisions that I think are open to discussion:

Our errors consist of a string message and a map with validation results as data. If there are multiple validation results, there are multiple error messages - I'm just concatting them with a newline separator, but maybe it should be a semicolon and they are all on one line, or some other format. I just picked one that made sense.

In the same vein, the validation results are a map that closely parallels SHACL sh:ValidationResult https://www.w3.org/TR/shacl/#results-validation-result

| fluree      | SHACL                        |
|-------------+------------------------------|
| :subject    | sh:focusNode                 |
| :path       | sh:path                      |
| :value      | sh:value                     |
| :shape      | sh:sourceShape               |
| :constraint | sh:sourceConstraintComponent |
| :message    | sh:resultMessage             |
| -           | sh:detail                    |
| -           | sh:resultSeverity            |
| :expect     | -                            |

Should they just map directly? Should we have string keys? I feel like we could really just lean into the existing SHACL vocabulary here, since the semantics should match. Feedback on the errors is very welcome, because it's easy to change, but I'd like to do it now rather than in the future.

Future work

SHACL constraints we still don't support:

Adding support for these now is very simple.

Performance should be comparable to what it was before, though in this implementation we are creating more channels than is strictly necessary. I think we could also be more clever with data lookups and caching, but I'm inclined to wait until that's a bottleneck rather than go at it without any measurements in hand.