apache / jena

Apache Jena
https://jena.apache.org/
Apache License 2.0
1.1k stars 650 forks source link

Interaction of GRAPH graph patterns and subqueries #2793

Open nkaralis opened 1 day ago

nkaralis commented 1 day ago

Version

5.2.0

Question

Hello,

I have some questions about the interaction of GRAPH graph patterns and subquries.

I am using version 5.2.0.

Assume the scenario described below.

First, I load a graph into two separate named graphs.

LOAD <https://raw.githubusercontent.com/w3c/rdf-tests/refs/heads/main/sparql/sparql11/functions/data.ttl> INTO GRAPH <http://www.example.org/graph1> ;
LOAD <https://raw.githubusercontent.com/w3c/rdf-tests/refs/heads/main/sparql/sparql11/functions/data.ttl> INTO GRAPH <http://www.example.org/graph2>

Both graphs contain 16 triples.

The query provided below, returns the triples found in both graphs, which results in 32 solutions. Here, ?g is always unbound.

SELECT * WHERE {
    GRAPH ?g { 
        {
            SELECT ?s ?p ?o  WHERE {
                ?s ?p ?o
            }
        }
    }
}

The query provided below also returns 32 results. In this case, ?g is always assigned a value (i.e., <http://www.example.org/graph1> or <http://www.example.org/graph2>)

SELECT * WHERE {
    GRAPH ?g { 
        {
            SELECT * WHERE {
                ?s ?p ?o
            }
        }
    }
}

I have the following questions:

For both queries, I was expecting 64 results: Cartesian product between the results of the subqueries (32 results) and the possbible values for ?g (2 named graphs).

Thank you in advance.

rvesse commented 1 day ago

Can you provide details of what your storage setup is e.g.

In algebra terms these end up being different algebra's which likely explains the difference in results.

Your first query yields the following algebra:

(base <http://example/base/>
  (project (?s ?p ?o)
    (quadpattern (quad ?g ?s ?p ?o))))

While your second yields the following algebra:

(base <http://example/base/>
  (quadpattern (quad ?g ?s ?p ?o)))

Notice that with the SELECT * in the inner query the project step is omitted from the generated algebra so ?g is always unbound. However, I'm not sure if this is the correct behaviour here, probably a question for @afs to answer


For both queries, I was expecting 64 results: Cartesian product between the results of the subqueries (32 results) and the possbible values for ?g (2 named graphs).

That shouldn't ever be the case, the way a GRAPH ?g clause is logically defined is that the inner pattern is executed independently for each graph in the dataset and the results are union'd together with the . So each graph independently yields 16 results and these union together to yield 32 results.

nkaralis commented 1 day ago

I am using fuseki with TDB2

# for starting the server
java -jar fuseki-server.jar --update --tdb2 --loc=databases/testing /endpoint

I am using the default config file found in apache-fuseki-5.2.0/run

# Licensed under the terms of http://www.apache.org/licenses/LICENSE-2.0

## Fuseki Server configuration file.

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .

[] rdf:type fuseki:Server ;
   # Example::
   # Server-wide query timeout.   
   # 
   # Timeout - server-wide default: milliseconds.
   # Format 1: "1000" -- 1 second timeout
   # Format 2: "10000,60000" -- 10s timeout to first result, 
   #                            then 60s timeout for the rest of query.
   #
   # See javadoc for ARQ.queryTimeout for details.
   # This can also be set on a per dataset basis in the dataset assembler.
   #
   # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "30000" ] ;

   # Add any custom classes you want to load.
   # Must have a "public static void init()" method.
   # ja:loadClass "your.code.Class" ;   

   # End triples.
   .

That shouldn't ever be the case, the way a GRAPH ?g clause is logically defined is that the inner pattern is executed independently for each graph in the dataset and the results are union'd together with the . So each graph independently yields 16 results and these union together to yield 32 results

I see. It makes sense, thank you