RDFLib / pySHACL

A Python validator for SHACL
Apache License 2.0
241 stars 63 forks source link

Using SPARQL constraints with the keyword graph on nquads #175

Open Manoe-K opened 1 year ago

Manoe-K commented 1 year ago

Hello, I am trying to use SHACL in order to verify the shape of Nquads data. For some of the constraints I use SPARQL constraints with the keyword graph in order to validate that the information within a graph is coherent with the information in the default graph.

Here's an example of such a constraint:

sh:sparql [
        sh:prefixes foaf: ;
        sh:select """
            SELECT $this
            WHERE {
                GRAPH $this { ?s ?p ?o. } 
                ?s a foaf:person.
            }
            """ ;
    ] ;
.

My current approach is to load n-quad data as a rdflib Dataset and then call validate() with data_graph=my_dataset.

Doing this will send the following exception, despite me already using a dataset:

Exception: You performed a query operation requiring a dataset (i.e. ConjunctiveGraph), but operating currently on a single graph.

By reading issue #26, I think I understood that validating over datasets validates over each graph one by one, which would explain the exception.

So is there a way to validate n-quads that would allow me to use such SPARQL constraints? Or in general, is there a way to validate a Dataset as one block?

ashleysommer commented 1 year ago

Hi @Manoe-K Thanks for bringing this up. The issue from #26 is resolved, and it is not related to this issue.

The issue thread you want to read is https://github.com/RDFLib/pySHACL/issues/152 and indeed your issue thread is considered a duplicate of https://github.com/RDFLib/pySHACL/issues/152.

You are right that what you want to do seems like it should be possible, but due to the architecture of PySHACL, it is not possible. (And that error you are seeing is misleading).

The SHACL W3C Spec is written with the assumption that your datagraph is a single graph, rather than a dataset or union of named graphs. In order to adhere to this assumption, when we added the feature to allow validation of an RDFLib Dataset it was necessary to iteratively run validation on each named graph separately.

In order to validate across the whole Dataset at once, it would be necessary to re-engineer large portions of PySHACL, and also require it to make decisions about validation that are outside the scope of the W3C SHACL Spec, that is dangerous territory. I have not used other SHACL Validation engines, but I believe none of the others have this ability either, for the same reason.