ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
399 stars 50 forks source link

Support for unconnected graph patterns/triples #272

Closed niklas88 closed 5 months ago

niklas88 commented 5 years ago

One long standing limitation of QLever is that the following doesn't work

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?where ?name WHERE {  
  wd:Q18642047 wdt:P108 ?where .
  wd:Q18642047 rdfs:label ?name .
  FILTER (lang(?name) = "en") .
}

Since I tried that a long time ago and assumed this was invalid in SPARQL and I think @Buchhold also said so. However @lehmann-4178656ch also tried to do something like this and it turns out this works just fine in Blazegraph. Also with the newer syntax changes one can now write this as

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?where ?name WHERE {
  wd:Q18642047 wdt:P108 ?where; rdfs:label ?name .
  FILTER (lang(?name) = "en") .
}

which also doesn't work but more obviously should. It seems this is based on a limitation of only being able to join on variables. The following query that turns the entity into a 1 × 1 table and gives it an associated variable makes the query workable.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?where ?name WHERE {
  VALUES ?value { wd:Q18642047 }
  ?value wdt:P108 ?where; rdfs:label ?name .
  FILTER (lang(?name) = "en") .
}
floriankramer commented 5 years ago

@niklas88 After thinking about this for a while I decided to look up the definition of solution mappings in the SPARQL standard. According to the standard a solution mapping u is a function from variables to iris. The solution of a query is then a multiset of these solution mappings (a multiset to allow for duplicate rows in the result). A valid solution for a BasicGraphPattern (BGP) (such as {wd:Q18642047 wdt:P108 ?where}) is a multiset of solution mappings which in turn is a reduction to the variables of a pattern instance mapping (P) from the basic graph pattern to a subset of the knowledge base. This means, that P(BGP) has to take wd:Q18642047 (from the example) into account to ensure it maps to a subset of the knowledge base. The solution mapping O which is the actual result of evaluating the triple, and later used as the input to subsequent joins discard everything that is not a variable though, and only stores the results for ?where (in the form of functions mapping ?where to iris).

The standard on join says, that the join of two multisets of solution mappings O1 and O2 is a new solution mapping that merges any two elements of O1 and O2 that are compatible. Two solution mappings u1 and u2 are compatible if for all variables v that are in both u1 and u2 u1(v) = u2(v). Thus the join of two multisets of solution mapping in which the solution mappings do not share any variables includes every possible combination of solution mappings from both multisets. For the first example query in the issue this happens to be the same as joining them on the non variable wd:Q18642047.

QLever is then currently non standard conform by considering queries consisting of disjoint graph patterns (in respect to their variables) as invalid instead of computing the join of them (which in sparql is well defined).

niklas88 commented 5 years ago

Good point about this actually being the possible combinations of solution mappings. This means it has nothing to do with the fixed values but it's all about how we treat GPs not connected by a variable.

The following disjoint query is also valid and just does a cross product on the solution sets.

(14 occupations for Einstein 6 occupations for Bohr 9 occupations for Dirac = 756 rows)

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?occ_einstein ?occ_bohr ?occ_dirac WHERE {
  wd:Q937 wdt:P106 ?occ_einstein .
  wd:Q7085 wdt:P106 ?occ_bohr . 
  wd:Q47480 wdt:P106 ?occ_dirac .
}

So afaik all we need to do is an all-pairs cross-product between the solution sets of all GraphPatterns in a GroupGraphPattern that are not connected by a variable. Then one would end up with the correct result also for the following query where I previously thought it would require a join on the fixed value (Einstein).

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?spouse ?sibling WHERE {
  wd:Q937 wdt:P3373 ?sibling;
          wdt:P26 ?spouse . 
}

This also means our TripleGraph is actually fine, we just have to treat its independent components differently.

Note that this extends to all kinds of unconnected GPs for example the below query is equivalent to the first but does the cross product across different GPs.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?occ_einstein ?occ_bohr ?occ_dirac WHERE {
  { SELECT ?occ_einstein WHERE { wd:Q937 wdt:P106 ?occ_einstein . } }
  { SELECT ?occ_bohr WHERE { wd:Q7085 wdt:P106 ?occ_bohr . } }
  { SELECT ?occ_dirac WHERE {wd:Q47480 wdt:P106 ?occ_dirac . } }
}
joka921 commented 5 months ago

Cartesian product joins between unconnected graph patterns have been supported for quite a while now, so this issue is fixed.