RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 555 forks source link

Serializing SPARQL Query Results with Aggregates over Variables from Optional Graph Pattern #2229

Closed prohde closed 1 year ago

prohde commented 1 year ago

I ran into an issue when serializing the results of SPARQL queries with aggregates from optional graph patterns, i.e., they might potentially be unbound. I am using rdflib==6.2.0.

The query in question is:

SELECT DISTINCT ?x (COUNT(DISTINCT ?inst) AS ?cnt) WHERE {
  ?x a <http://swat.cse.lehigh.edu/onto/univ-bench.owl#GraduateStudent> 
  OPTIONAL {
    VALUES ?inst { <http://www.University0.edu> <http://www.University1.edu> }. 
    ?x <http://swat.cse.lehigh.edu/onto/univ-bench.owl#undergraduateDegreeFrom> ?inst .
  }
} GROUP BY ?x

For each graduate student, I want to know how many undergraduate degrees he/she has from the list of universities provided using the VALUES clause. I am using OPTIONAL here since I am also interested in getting 0 if the student doesn't have a degree from one of the specified universities.

The query runs fine in my SPARQL endpoint but when I try to use rdflib as an in-memory RDF graph, I get the following exception:

Traceback (most recent call last):
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evalutils.py", line 68, in _eval
    return ctx[expr]
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/sparql.py", line 175, in __getitem__
    return self.ctx.initBindings[key]  # type: ignore[index]
KeyError: rdflib.term.Variable('inst')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/my_path/graph_test.py", line 33, in run_query
    res_json = res.serialize(format='json')
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/query.py", line 252, in serialize
    serializer.serialize(stream2, encoding=encoding, **args)  # type: ignore
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/results/jsonresults.py", line 43, in serialize
    self._bindingToJSON(x) for x in self.result.bindings
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/query.py", line 184, in bindings
    self._bindings += list(self._genbindings)
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evaluate.py", line 541, in evalDistinct
    for x in res:
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evaluate.py", line 550, in <genexpr>
    return (row.project(project.PV) for row in res)
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evaluate.py", line 100, in evalExtend
    for c in evalPart(ctx, extend.p):
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evaluate.py", line 100, in evalExtend
    for c in evalPart(ctx, extend.p):
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evaluate.py", line 453, in evalAggregateJoin
    aggregator.update(row)
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/aggregates.py", line 256, in update
    if acc.use_row(row):
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/aggregates.py", line 68, in use_row
    return self.eval_row(row) not in self.seen
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/aggregates.py", line 62, in eval_row
    return _eval(self.expr, row)
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evalutils.py", line 71, in _eval
    raise NotBoundError("Variable %s is not bound" % expr)
rdflib.plugins.sparql.sparql.NotBoundError: Variable inst is not bound

I tried the following rewriting of my original query to bypass that issue but with no success.

SELECT DISTINCT ?x (IF(bound(?inst), COUNT(DISTINCT ?inst), 0) AS ?cnt) WHERE {
  ?x a <http://swat.cse.lehigh.edu/onto/univ-bench.owl#GraduateStudent> 
  OPTIONAL {
    VALUES ?inst { <http://www.University0.edu> <http://www.University1.edu> }. 
    ?x <http://swat.cse.lehigh.edu/onto/univ-bench.owl#undergraduateDegreeFrom> ?inst .
  }
}

To my understanding, the error should occur in the count which shouldn't be executed due to the IF statement.

Many thanks in advance!

WhiteGobo commented 1 year ago

I couldnt generate that error(with python 3.11). My script:

#import sys
#sys.path.insert(0,"path/to/rdflib-6.2")
import rdflib

query = """
SELECT DISTINCT ?x (COUNT(DISTINCT ?inst) as ?cnt)
WHERE {
    ?x a ex:a
    OPTIONAL {
        VALUES ?inst {ex:b ex:c}.
        ?x ex:d ?inst.
    }
}  GROUP BY ?x
"""

ex = rdflib.Namespace("http://example.com/")
g = rdflib.Graph()
g.bind("ex", ex)
g.parse(format="ttl", data="""@prefix ex: <http://example.com/>.
        <1> a ex:a;
            ex:d ex:b.
        <2> a ex:a;
            ex:d ex:c;
            ex:d ex:b.
        <3> a ex:a;
            ex:d ex:c.
""")
print(list(g.query(query)))
prohde commented 1 year ago

Hi @WhiteGobo,

this is because all of your instances (<1>, <2>, and <3>) have a connection to at least ex:b or ex:c via ex:d. If you remove the last triple, expecting the count for <3> to be 0 (zero), then the same error appears.

import rdflib

query = """
SELECT DISTINCT ?x (COUNT(DISTINCT ?inst) as ?cnt)
WHERE {
    ?x a ex:a
    OPTIONAL {
        VALUES ?inst {ex:b ex:c}.
        ?x ex:d ?inst.
    }
}  GROUP BY ?x
"""

ex = rdflib.Namespace("http://example.com/")
g = rdflib.Graph()
g.bind("ex", ex)
g.parse(format="ttl", data="""@prefix ex: <http://example.com/>.
        <1> a ex:a;
            ex:d ex:b.
        <2> a ex:a;
            ex:d ex:c;
            ex:d ex:b.
        <3> a ex:a.
""")
print(list(g.query(query)))
WhiteGobo commented 1 year ago

Ok i have made a fix in an extra branch. That should resolve that Error. Link to the branch

I need some time to make a PR out of this, because i would create a test and ive got to look at plugins/sparql/aggregates.py because they tried to catch the NotBoundError and im not sure if that line of code ever gets used.