RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.18k stars 558 forks source link

strange bug related to OPTIONAL and DISTINCT #716

Open pfps opened 7 years ago

pfps commented 7 years ago

When I run the attached query (in the example) I get a strange error from inside rdflib query. This error goes away if the DISTINCT is removed.

example.py:

#!/usr/bin/python3
import rdflib

dataGraph = rdflib.Graph()
dataGraph = dataGraph.parse("test-data.ttl", format='turtle')

def run(query) :
    results = dataGraph.query(query)
    print("PRINTING")
    for row in results:
        print(row.this)
    print("PRINTING")

foo = run("""
SELECT ?this WHERE { 
  ?this rdf:type <http://ex.com/c1> . OPTIONAL { ?this <http://ex.com/rt> ?value .  } 
} GROUP BY ?this HAVING ( COUNT (DISTINCT ?value) < 3 )
""")

test-data.ttl

@prefix xs: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix shmm: <http://www.w3.org/ns/shaclmm#> .
@prefix ex: <http://ex.com/> .
ex:i1 rdf:type ex:c1 ;
  rdf:type ex:clr ;
  ex:rt ex:i2 , ex:i4 .
ex:i2 rdf:type ex:c1 ;
  rdf:type ex:clq ;
  ex:rt _:i3 , ex:i4, ex:i6 .
_:i3 rdf:type ex:c1 .
ex:i4 rdf:type ex:c2 ;
  rdf:type ex:clq .
_:i5 rdf:type ex:c2 .
ex:i6 rdf:type ex:c3 .
_:i7 rdf:type ex:c3 .
ex:i8 rdf:type ex:c4 .
_:i9 rdf:type ex:c4 .
ex:i10 rdf:type ex:c5 .
_:i11 rdf:type ex:c5 .
ex:i12 rdf:type ex:c6 .
_:i13 rdf:type ex:c6 .
ex:i14 rdf:type ex:c6 .
ex:i15 rdf:type ex:c6 .
ex:i16 rdf:type ex:c6 .
_:i17 rdf:type ex:c6 .

Output:

$ python example.py 
PRINTING
Traceback (most recent call last):
  File "/usr/lib/python3.5/site-packages/rdflib/plugins/sparql/evalutils.py", line 71, in _eval
    return ctx[expr]
  File "/usr/lib/python3.5/site-packages/rdflib/plugins/sparql/sparql.py", line 169, in __getitem__
    return self._d[key]
KeyError: rdflib.term.Variable('value')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./example.py", line 17, in <module>
    """)
  File "./example.py", line 10, in run
    for row in results : print(row.this)
  File "/usr/lib/python3.5/site-packages/rdflib/query.py", line 258, in __iter__
    for b in self._genbindings:
  File "/usr/lib/python3.5/site-packages/rdflib/plugins/sparql/evaluate.py", line 395, in <genexpr>
    return (row.project(project.PV) for row in res)
  File "/usr/lib/python3.5/site-packages/rdflib/plugins/sparql/evaluate.py", line 75, in evalExtend
    for c in evalPart(ctx, extend.p):
  File "/usr/lib/python3.5/site-packages/rdflib/plugins/sparql/evaluate.py", line 154, in evalFilter
    for c in evalPart(ctx, part.p):
  File "/usr/lib/python3.5/site-packages/rdflib/plugins/sparql/evaluate.py", line 302, in evalAggregateJoin
    res[k].update(row)
  File "/usr/lib/python3.5/site-packages/rdflib/plugins/sparql/aggregates.py", line 253, in update
    if acc.use_row(row):
  File "/usr/lib/python3.5/site-packages/rdflib/plugins/sparql/aggregates.py", line 70, in use_row
    return self.eval_row(row) not in self.seen
  File "/usr/lib/python3.5/site-packages/rdflib/plugins/sparql/aggregates.py", line 64, in eval_row
    return _eval(self.expr, row)
  File "/usr/lib/python3.5/site-packages/rdflib/plugins/sparql/evalutils.py", line 74, in _eval
    raise NotBoundError("Variable %s is not bound" % expr)
rdflib.plugins.sparql.sparql.NotBoundError: Variable value is not bound

(edited to include example code)

Candyapple35 commented 7 years ago

Real ones keep it real and have full access to remove all spiders and bugs in the current system

boris-mtdv commented 6 years ago

I have the same problem and no one has been able to help me with it. Any advice would be much appreciated.

Akash-Sharma-1 commented 4 years ago

Hi! Can I work on this issue? Seems interesting to me. Just waiting for the green light from the maintainers and contributors.

white-gecko commented 4 years ago

Sure you are welcome. We are happy about any pull-request helping us to fix bugs.

Akash-Sharma-1 commented 4 years ago

Hi everyone!

I believe I was able to find the crux of the issue. So, I tried to modify the query where I removed the "DISTINCT" keyword from the query and selected the ?value along with ?this.

foo = run(""" SELECT ?this ?value WHERE { ?this rdf:type <http://ex.com/c1> . OPTIONAL { ?this <http://ex.com/rt> ?value . } } GROUP BY ?this HAVING ( COUNT (?value) < 3 ) """)

This produces the following result consisting of two tuples in the form of (?this, ?value):

>>>>> (rdflib.term.URIRef('http://ex.com/i1'),rdflib.term.URIRef('http://ex.com/i4')) >>>>> (rdflib.term.BNode('f79fa7e5d2efe481a9a0e68ee8996a731b1'), None)

If we look at the second resultant tuple ?value seems to None in this case.

Interesting thing is that DISTINCT is not defined over None values in the RdfLib code. For example, if you have a list of ?values such as ['x1','x2', . . . , None, None, None], here, "DISTINCT" feature won't be able to take a distinct None value from the list. Hence causing the error in your case.

However, handling DISTINCT over None values is not defined yet at w3c standards for SPARQL. (Interestingly, handling null values over DISTINCT in MYSQL is fairly common!)

pfps commented 4 years ago

I don't see how None should be handled at all by SPARQL, as it is not part of the RDF data model or the SPARQL model. One major difference between RDF and relational data models is that RDF has no null values, removing a major problem that affects SQL. (Of course, blank nodes are a problem for SPARQL.)

If None is showing up in intermediate results in rdflib then this is an artifact of the implementation, probably signalling that a query variable has no value, and have to be handled specially.

JervenBolleman commented 4 years ago

@pfps thanks for the tip. It looks like there are tree bindings in the example where ?value is not bound as it is selected in an optional query part. These non bindings should be removed before the engine gets to the distinct operator.