RDFLib / pySHACL

A Python validator for SHACL
Apache License 2.0
241 stars 63 forks source link

[Discussion] Higher performance "remote" validation #226

Open ashleysommer opened 2 months ago

ashleysommer commented 2 months ago

This is something I've been thinking about for a while, since it was originally introduced in https://github.com/RDFLib/pySHACL/issues/174

The issue is PySHACL is primarily designed to run on Graphs in memory using the RDFLib memory store. There are two primary reasons for this: 1) PySHACL creates a copy of the target datagraph, to a new in-memory graph, to do operations on it, to avoid polluting the input graph 2) PySHACL uses native RDFLib graph operations (eg, graph.subjects(), graph.objects(), graph.rest()) these are atomic graph operations that read directly from the underlying graph store, these operations are hand-built and hand-tweaked for each SHACL constraint to achieve maximum performance.

These two concerns do not translate well to "remote" graphs, where remote means graphs that are not in a RDFlib local store, they live in a Graph Store service, and are accessed via a SPARQL endpoint. This can be the case if you're validating against a sparqlstore or sparqlconnector graph in RDFLib, or using a SparqlWrapper() on your graph.

In the remote case, it is not efficient and often not desirable (or not possible) to create a full in-memory working-copy of the remote graph into a memory-backed rdflib graph. And it is very bad for performance if we're running atomic graph lookup operations via the SPARQL connector, because this results in tens or hundreds of individual synchronous SPARQL queries executed against the remote graph for each constraint evaluated.

So I'm proposing a new mode of operation for PySHACL, some kind of "SPARQL-optimised" form, or "remote" mode that will cause PySHACL to use purpose-build SPARQL queries to perform validation instead of RDFLib graph operations. This will be an implementation of the "driver only" interpretation of PySHACL as proposed in #174. The key distinction being this new mode will not replace the normal operating mode of PySHACL, and will not affect performance for users who primarily use in-memory graph validation.

There are some questions to think about:

1) Could this be a commandline switch or validator argument? Something the user can switch on manually? Or should it be auto-detected if the user is passing in a sparqlconnector, sparqlstore or SparqlWrapper graph. Can we simply use a https:// SPARQL endpoint as the graph argument on the commandline and have it work automatically? 2) As we're not creating a working-copy of the graph for the validation, does that mean we must avoid polluting the source graph? That means we cannot do any OWL/RDFS inferencing, no SHACL Rules can be applied in this mode, and SHACL functions must also be turned off (as these can pollute the graph too) in remote mode. 3) Are there some cases when we do want to pollute the graph? Eg, trying to use PySHACL as a SHACL Rule engine, where you do want the new triples to appear in the source graph. This doesn't make sense to do in a in-memory local graph, but I see the utility of doing it on a remote.