TopQuadrant / shacl

SHACL API in Java based on Apache Jena
Apache License 2.0
217 stars 61 forks source link

SHACL validation with sh:select is producing different results than essentially the same SPARQL executed by QueryExecution #129

Open beaudet opened 2 years ago

beaudet commented 2 years ago

Please see contents of the attached zip file for reference.

The class LinkedArtValidator

  1. loads an ontology
  2. loads a SHACL shape
  3. loads very simple test data
  4. produces a validation report showing a validation failure of one of the two individuals in the data graph.

The data has two nested art objects, both of the same type (crm:E22_Human-Made_Object), but the child object is a part (crm:P46_is_composed_of) the parent object. The sh:select in the object.ttl shape file should never return the child object as a solution since it should be removed by the expression:

minus { ?s2 crm:P46_is_composed_of $this }

and yet, the child object is returned.

Once the report is produced, I'm running essentially the same query with the QueryExecutor and no results are returned. That's true whether or not I comment out the bind() statement in the SPARQL sent to the QueryExecutor to mimic the behavior of the SHACL validator which binds the focus node just prior to executing the SPARQL.

The full output of the class LinkedArtValidator produces:

Conforms = false
Node=<https://linked.art/example/object/childObject>
  There are parent art objects without an 'artwork' classification (aat:300133025)
Compare topbraid validation report above with straight Apache Jena SPARQL Query below
No results - this is what we would expect when the SHACL runs too but that's not the case.  When SHACL runs, the childObject is returned

I'm fairly new to Apache Jena, SHACL, and SPARQL so perhaps I'm missing something, but this behavior is puzzling, so I figure it might also be a bug.

I've rolled back the versions I've been using up until now to 1.4.1 for SHACL and 4.2.0 for Apache Jena, but the behavior is identical to what I'm seeing with a local SHACL 1.4.3-SNAPSHOT (updated for compatibility with Jena 4.3.1) and Jena 4.3.1.

Thanks! shacl-validation-test.zip

beaudet commented 2 years ago

Update: It seems the following modifications to the SHACL sh:select query produce the intended results. Is it "normal" to have to explicitly bind a 2nd variable to ?this when an sh:select uses a sub-query?

                prefix aat:    <http://vocab.getty.edu/aat/>
                prefix crm:    <http://www.cidoc-crm.org/cidoc-crm/>

                select ?this
                {
                    # I think this should work as a simulation of the variable binding that happens during SHACL validation
                    bind (?this as ?s1)
                    {
                    select
                        ?s1 ?this
                        ( count(?s1) as ?topLevelObjects )
                        ( count(?term) as ?validArtTermCount )
                        where {

                            ?s1 a crm:E22_Human-Made_Object
                            optional {
                                ?s1 crm:P2_has_type ?term
                                filter (?term = aat:300133025)
                            }
                            minus { ?s2 crm:P46_is_composed_of ?s1 }
                        }

                        group by ?s1 ?this
                    }
                }
                having (?topLevelObjects > 0 && ?validArtTermCount < 1)

The SPARQL query plan (changes from this

(project (?this)
  (project (?this ?topLevelObjects ?validArtTermCount)
    (filter (< ?validArtTermCount 1)
      (extend ((?validArtTermCount ?/.1))
        (filter (> ?topLevelObjects 0)
          (extend ((?topLevelObjects ?/.0))
            (group (?this) ((?/.0 (count ?this)) (?/.1 (count ?/term)))
              (minus
                (conditional
                  (bgp (triple <https://linked.art/example/object/childObject> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E22_Human-Made_Object>))
                  (assign ((?/term <http://vocab.getty.edu/aat/300133025>))
                    (bgp (triple <https://linked.art/example/object/childObject> <http://www.cidoc-crm.org/cidoc-crm/P2_has_type> <http://vocab.getty.edu/aat/300133025>))))
                (bgp (triple ?/s2 <http://www.cidoc-crm.org/cidoc-crm/P46_is_composed_of> <https://linked.art/example/object/childObject>))))))))))

to this

(project (?this)
  (join
    (extend ((?s1 <https://linked.art/example/object/childObject>))
      (table unit))
    (project (?s1 ?this ?topLevelObjects ?validArtTermCount)
      (filter (< ?validArtTermCount 1)
        (extend ((?validArtTermCount ?/.1))
          (filter (> ?topLevelObjects 0)
            (extend ((?topLevelObjects ?/.0))
              (group (?s1 ?this) ((?/.0 (count ?s1)) (?/.1 (count ?/term)))
                (minus
                  (conditional
                    (bgp (triple ?s1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E22_Human-Made_Object>))
                    (assign ((?/term <http://vocab.getty.edu/aat/300133025>))
                      (bgp (triple ?s1 <http://www.cidoc-crm.org/cidoc-crm/P2_has_type> <http://vocab.getty.edu/aat/300133025>))))
                  (bgp (triple ?/s2 <http://www.cidoc-crm.org/cidoc-crm/P46_is_composed_of> ?s1)))))))))))
HolgerKnublauch commented 2 years ago

Sorry this goes a bit beyond what I can support to investigate. These pre-binding issues are complex, MINUS is often misunderstood and I don't understand all implications either.

I assume you have checked https://www.w3.org/TR/shacl/#pre-binding for example the rule that Subqueries must return all pre-bound variables. Also notice that the execution works from the inside out, i.e. the inner SELECT is executed first and (I believe) doesn't use the variables from the outside.

afs commented 2 years ago

SHACL spec says "SPARQL queries must not contain a MINUS clause" and you have $this inside the MINUS.

    # I think this should work as a simulation of the variable binding that happens during SHACL validation

It changes scoping of variables quite a lot and might affect the meaning of the query because ?s1

(bgp (triple ?/s2 <.../P46_is_composed_of> <https://linked.art/example/object/childObject>))

became

(bgp (triple ?/s2 <.../P46_is_composed_of> ?s1))

BTW Jena is changing to prefer substitution semantics (replacement rewrite in the syntax, before algebra generations, and optimization) as being clearer for users, works for remote queries, and has defined effects. See QueryExecBuilder.

https://afs.github.io/substitute.html is slightly different again and is correlated subquery with variable retention.

beaudet commented 2 years ago

Thanks for pointing that out @afs - yes, my SPARQL query is quite non-SHACL compliant isn't it? It sounds like the SHACL processor should be reporting a failure in that case so I guess this is a bug report after all.

It does make me wonder about what the best approach to accomplishing my requirements with SHACL is so I'm wondering if you have any guidance on that. For example, should I instead create three art object shapes and combine them with sh:or in the parent shape? Is that a better approach in general than getting too fancy with the SPARQL in sh:selects?

HolgerKnublauch commented 2 years ago

MINUS can typically be avoided through a FILTER EXISTS / NOT EXISTS. Have you tried that and were you able to get the results you wanted?

beaudet commented 2 years ago

I'll definitely give that a shot but the larger issue might be the bind that ties ?s2 to $this which is also disallowed by the spec.

beaudet commented 2 years ago

I'm also trying to segment art objects into two different shapes which can be either one or the other. That might avoid some of the more complicated SPARQL.

Related question although probably one for Jena rather than this project. I'm not seeing inverseOf handled in general SPARQL queries (non-shacl). I would expect a query for ?o :inversePredicate ?s should work with ?s :Predicate ?o is present in the graph although that does not seem to be the case despite using a Model with inference support. Is there some kind of inference spell I failed to cast with Jena? Do I need to do anything beyond selecting a ModelSpec with inference support in order to be able to query on inferred triples?

HolgerKnublauch commented 2 years ago

The latter really sounds like a Jena question that should go elsewhere, e.g. the Jena-users mailing list. SPARQL itself doesn't know anything about inferencing, and will only see the triples that you have given it. So I would suggest you send along the exact inferencing configuration and API call for the Jena question.

beaudet commented 2 years ago

Thanks for the lead. I emailed the Apache Jena mailing list and found another similar question from 2020.

As for this SHACL lib, do you think it's a bug that spec-non-compliant SHACL is allowed to pass to the query executor without issuing any warnings or errors?