This is a complete overhaul of our SHACL validation. We now produce something similar to a sh:ValidationReport, and are much more closely aligned with the SHACL spec and official compliance test suite.
Before, we would try to query for shapes that targeted any modified subject. This was just a heuristic, and was not necessarily 100% correct. Now, we assume that all node shapes have @typesh:NodeShape, and we fetch them all. This ensures that nothing is missed. It also allows us to remove the subj-mods atom. There is a :schema :shapes cache on the db that is now just a shape-sid->shape map, which is reset if there are any new flakes with type sh:NodeShape, sh:PropertyShape, or sh:path (since property shapes will rarely have type sh:PropertyShape, but are required to havesh:path).
Then, for each modified subject, we check each shape, first checking that it is active (via official support for sh:deactivated), and then we check each shape's target declarations to see if the modified subject is targeted. If it is targeted, we proceed to validate the constraints in the shape.
Before we only supported sh:targetClass and sh:targetObjectsOf target declarations. Now we support all SHACL target declarations:
sh:targetSubjectsOf
sh:targetNode
implicit targets (where the sh:NodeShape has another @type, it will implicitly target subjects that share that type.
Constraint validation is driven by the validate-constraint multimethod. Each validator takes a shape, a focus node, a set of value nodes, and returns nil if there are no violations, or a sequence of sh:ValidationResult-like results. Each subject has every constraint validated to produce a complete (as opposed to partial) list of violations. Once all the constraints have been checked, if a subject has any violations we halt further processing and throw a validation error.
Design decisions
In the course of implementing this, I made some design decisions that I think are open to discussion:
Our errors consist of a string message and a map with validation results as data. If there are multiple validation results, there are multiple error messages - I'm just concatting them with a newline separator, but maybe it should be a semicolon and they are all on one line, or some other format. I just picked one that made sense.
Should they just map directly? Should we have string keys? I feel like we could really just lean into the existing SHACL vocabulary here, since the semantics should match. Feedback on the errors is very welcome, because it's easy to change, but I'd like to do it now rather than in the future.
Future work
SHACL constraints we still don't support:
sh:languageIn
sh:uniqueLang
sh:and
sh:or
sh:xone
Adding support for these now is very simple.
Performance should be comparable to what it was before, though in this implementation we are creating more channels than is strictly necessary. I think we could also be more clever with data lookups and caching, but I'm inclined to wait until that's a bottleneck rather than go at it without any measurements in hand.
This is a complete overhaul of our SHACL validation. We now produce something similar to a
sh:ValidationReport
, and are much more closely aligned with the SHACL spec and official compliance test suite.Fixes https://github.com/fluree/db/issues/565
Architectural outline
Before, we would try to query for shapes that targeted any modified subject. This was just a heuristic, and was not necessarily 100% correct. Now, we assume that all node shapes have
@type
sh:NodeShape
, and we fetch them all. This ensures that nothing is missed. It also allows us to remove thesubj-mods
atom. There is a:schema :shapes
cache on the db that is now just ashape-sid->shape
map, which is reset if there are any new flakes with typesh:NodeShape
,sh:PropertyShape
, orsh:path
(since property shapes will rarely have typesh:PropertyShape, but are required to have
sh:path).Then, for each modified subject, we check each shape, first checking that it is active (via official support for
sh:deactivated
), and then we check each shape's target declarations to see if the modified subject is targeted. If it is targeted, we proceed to validate the constraints in the shape.Before we only supported
sh:targetClass
andsh:targetObjectsOf
target declarations. Now we support all SHACL target declarations:sh:targetSubjectsOf
sh:targetNode
sh:NodeShape
has another@type
, it will implicitly target subjects that share that type.Constraint validation is driven by the
validate-constraint
multimethod. Each validator takes a shape, a focus node, a set of value nodes, and returnsnil
if there are no violations, or a sequence ofsh:ValidationResult
-like results. Each subject has every constraint validated to produce a complete (as opposed to partial) list of violations. Once all the constraints have been checked, if a subject has any violations we halt further processing and throw a validation error.Design decisions
In the course of implementing this, I made some design decisions that I think are open to discussion:
Our errors consist of a string message and a map with validation results as data. If there are multiple validation results, there are multiple error messages - I'm just concatting them with a newline separator, but maybe it should be a semicolon and they are all on one line, or some other format. I just picked one that made sense.
In the same vein, the validation results are a map that closely parallels SHACL
sh:ValidationResult
https://www.w3.org/TR/shacl/#results-validation-resultShould they just map directly? Should we have string keys? I feel like we could really just lean into the existing SHACL vocabulary here, since the semantics should match. Feedback on the errors is very welcome, because it's easy to change, but I'd like to do it now rather than in the future.
Future work
SHACL constraints we still don't support:
sh:languageIn
sh:uniqueLang
sh:and
sh:or
sh:xone
Adding support for these now is very simple.
Performance should be comparable to what it was before, though in this implementation we are creating more channels than is strictly necessary. I think we could also be more clever with data lookups and caching, but I'm inclined to wait until that's a bottleneck rather than go at it without any measurements in hand.