TREEcg / extract-cbd-shape

Get all RDF triples/quads related to an entity based on CBD and a SHACL shape
https://treecg.github.io/extract-cbd-shape/
MIT License
9 stars 2 forks source link

Extract CBD Shape

Given (i) an RdfStore (see rdf-stores) of triples, (ii) an RdfStore with a SHACL shape’s triples, and (iii) a target entity URI, this library will extract all triples that belong to the entity. If more triples of the entity are needed, extra triples are retrieved by dereferencing the relevant entity.

The algorithm is a proposal to be standardized as part of W3C’s TREE hypermedia Community Group as the member extraction algorithm. This algorithm needs to be efficient and unambiguously defined, so that various implementations of the member extraction algorithm will result in the same set of triples. As a trade-off, the resulting set of triples is not guaranteed to be validated by the SHACL shape.

The algorithm is inspired by, and an in-between between CBD and Shape Fragments, thanks to Thomas Bergwinkl and his blog post on a SHACL engine.

Use it

npm install extract-cbd-shape
import {CBDShapeExtractor} from "extract-cbd-shape";
// ...
let extractor = new CBDShapeExtractor(shapesGraph);
let entityquads = await extractor.extract(store, entityId, shapeId, graphsToIgnore);

Test it

Tests and examples provided in the tests library. Run them using mocha which can be invoked using npm test

The extraction algorithm

This is an extension of CBD. It extracts:

  1. all quads with subject this entity, and their blank node triples (recursively)
  2. all quads with a named graph matching the entity we’re looking up
  3. It takes hints from a Shape Template (see ↓)

To be discussed:

  1. Should it also extract all RDF reification quads? (Included in the original CBD)
  2. Should it also extract all singleton properties?
  3. Should it also extract RDF* annnotations?

The first focus node is set by the user. 1a. If a shape is set, create a shape template and execute the shape template extraction algorithm 1b. If no shape was set, extract all quads with subject the focus node, and recursively include its blank nodes (see also CBD)

  1. Extract all quads with the graph matching the focus node
  2. When no quads were extracted from 1 and 2, a client MUST dereference the focus node and re-execute 1 and 2.

Shape Template extraction

The Shape Template is a structure that looks as follows:

class ShapeTemplate {
    closed: boolean;
    requiredPaths: Path[];
    optionalPaths: Path[];
    nodelinks: NodeLink[];
    atLeastOneLists: [ Shape[] ];
}
class NodeLink {
    shape: ShapeTemplate;
    path: Path;
}

Paths in the shape templates are SHACL Property Paths.

A Shape Template has

Note: Certain quads are going to be matched by the algorithm multiple times. Each quad will of course be part of the member only once.

This results in this algorithm:

  1. If it is open, a client MUST extract all quads, after a potential HTTP request to the focus node, with subject the focus node, and recursively include its blank nodes (see also CBD)
  2. If the current focus node is a named node and it was not requested before:
    • test if all required paths are set, if not do an HTTP request, if they are set, then,
    • test if at least one of each list in the atLeastOneLists was set. If not, do an HTTP request.
  3. Visit all paths (required, optional, nodelinks and recursively the shapes in the atLeastOneLists if the shape is valid) paths and add all quads necessary to reach the targets to the result
  4. For the results of nodelinks, if the target is a named node, set it as a focus node and repeat this algorithm with that nodelink’s shape as a shape

Generating a shape template from SHACL

If there’s a shape set, the SHACL shape MUST be processed towards a Shape Template as follows:

  1. Checks if the shape is deactivated (:S sh:deactivated true), if it is, don’t continue
  2. Check if the shape is closed (:S sh:closed true), set the closed boolean to true.
  3. All sh:property elements with an sh:node link are added to the shape’s NodeLinks array
  4. Add all properties with sh:minCount > 0 to the Required Paths array, and all others to the optional paths.
  5. Processes the conditionals sh:xone, sh:or and sh:and (but doesn’t process sh:not):
    • sh:and: all properties on that shape template MUST be merged with the current shape template
    • sh:xone and sh:or: in both cases, at least one item must match at least one quad for all required paths. If not, it will do an HTTP request to the current namednode.

Note: The way we process SHACL shapes into Shape Template is important to understand in order to know when an HTTP request will be triggered when designing SHACL shapes. A cardinality constraint not being exactly matched or a sh:pattern not being respected will not trigger an HTTP request, and instead just add the invalid quads to the Member. This is a design choice: we only define triggers for HTTP request from the SHACL shape to come to a complete set of quads describing the member the data publisher pointed at using tree:member.

Note: it only takes hints (it does not guarantee a result that validates) from an optional SHACL shapes graph. It only uses the parts relevant for discovery from the SHACL Core Constraint Components. It does not support SPARQL or Javascript.

It won’t:

  1. Process more complex validation instructions that are part of SHACL such as sh:class, inLanguage, pattern, value, qualified value shapes, etc. It is the data publisher’s responsibility to provide valid data, or it is the responsibility of the user of the library to validate the quads afterwards.
  2. Do automatic target selection based on e.g., targetClass: you need to set the target.

Creating the Shape Template from ShEx

TODO

Logging

Logging can be enabled using the DEBUG environment variable, DEBUG=extract-cbd-shape:*.