SPARQL parser bug in evaluating unicode escapes

kasei commented 2 years ago

There seems to be a couple of bugs in rdflib.plugins.sparql.parser in the handling of unicode escapes.

Trying to parse this query:

SELECT * WHERE { ?s ?p "\u00a71234" }

reveals two bugs. The first is in error handling in expandUnicodeEscapes where constructing the error message fails due to an attempt to concatenate a string and a Match object:

TypeError: can only concatenate str (not "re.Match") to str

Fixing this reveals the more serious issue:

ValueError: chr() arg not in range(0x110000)

The regular expression used to unescape the unicode data uses the IGNORECASE flag:

expandUnicodeEscapes_re = re.compile(r"\\u([0-9a-f]{4}(?:[0-9a-f]{4})?)", flags=re.I)

This is fine for matching varied-case hex characters, but it conflates the \u and \U handling, allowing either form to match either 4 or 8 hex digits. In the example query above, the lowercase \u should only use the first 4 hex digits (00a7) to produce the character §, resulting in the object literal "§1234". Instead, it finds 8 valid-looking hex digits (00a71234) and then fails because this is a number outside the range of valid unicode codepoints. A fix for this issue should differentiate the two escaping cases \u and \U and match only 4 and 8 digits, respectively.

This bug also allows parsing of what should be invalid input in cases where a \U escape is caused to match only 4 hex digits. For example, SELECT * WHERE { ?s ?p "\U0001HHHH" } is parsed without error, despite the invalid escape sequence.

ghost commented 2 years ago

Um, could you provide more details on platform and Python version? In my profound ineptitude, I don't seem to be able to reproduce the reported error using Python 3.8 on a Linux Mint 20.3 Una distro.

Here's my test code:

$ cat test_unicode_escape.py 
import rdflib

tarek = rdflib.URIRef("urn:example:tarek")
likes = rdflib.URIRef("urn:example:likes")

g = rdflib.Graph()

g.add((tarek, likes, rdflib.Literal("\u00a71234")))

assert list(g) == [
    (
        rdflib.term.URIRef('urn:example:tarek'),
        rdflib.term.URIRef('urn:example:likes'),
        rdflib.term.Literal('§1234')
    )
]

q = """SELECT * WHERE { ?s ?p "\u00a71234" }"""

res = g.query(q)
lres = list(res)
assert rdflib.term.URIRef('urn:example:likes') in lres[0]
assert rdflib.term.URIRef('urn:example:tarek') in lres[0]

# q = 'SELECT * WHERE { ?s ?p "\U0001HHHH" }'

(I can't even get SELECT * WHERE { ?s ?p "\U0001HHHH" } past the interpreter):


$ python test_unicode_escape.py 
  File "test_unicode_escape.py", line 19
    q = 'SELECT * WHERE { ?s ?p "\U0001HHHH" }'
        ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 24-29: truncated \UXXXXXXXX escape```

kasei commented 2 years ago

Sure. Apologies. I'm not super familiar with rdflib, so specifics of calling code may be important here. I'm trying this on Python 3.8.9 on MacOS. I didn't try to evaluate the query, just parse it. In the below code, both parseQuery and prepareQuery lead to problems.

#!/usr/bin/env python3

from rdflib.plugins.sparql import prepareQuery
from rdflib.plugins.sparql.parser import parseQuery

if __name__ == '__main__':
    sparql = 'SELECT * WHERE { ?s ?p "\\u00a71234" }'
    print(parseQuery(sparql))
    print(prepareQuery(sparql))

kasei commented 2 years ago

I also suspect that you may be hitting python encoding issues (and not sparql parser encoding issues) when you use just a single backslash:

q = """SELECT * WHERE { ?s ?p "\u00a71234" }"""

ghost commented 2 years ago

I also suspect that you may be hitting python encoding issues (and not sparql parser encoding issues) when you use just a single backslash:
q = """SELECT * WHERE { ?s ?p "\u00a71234" }"""

I guess my limited model of the escaping principles is preventing me from achieving a full understanding of the issue you're reporting, gonna have to leave it to more knowledgeable folks :face_exhaling:

kasei commented 2 years ago

I guess my limited model of the escaping principles is preventing me from achieving a full understanding of the issue you're reporting, gonna have to leave it to more knowledgeable folks 😮‍💨

It might be more understandable if you put SELECT * WHERE { ?s ?p "\u00a71234" } into a test.rq file, and then read the contents of the file into the variable q. Then you skip python trying to unescape stuff in the query as if it were just a string in python code.

ghost commented 2 years ago

It might be more understandable if you put SELECT * WHERE { ?s ?p "\u00a71234" } into a test.rq file, and then read the contents of the file into the variable q.

Ah right, I see. Thank you for your patience. For those following along at home, there's a discussion of numeric character escapes in https://github.com/w3c/sparql-12/issues/77 which I found informative. Also it's worth noting that currently RDFLib isn't testing the SPARQL parser against the latest W3 test suite but uses the earlier version - which doesn't have the codepoint tests added in https://github.com/w3c/rdf-tests/pull/67. @aucampia and I are gradually hewing the RDFLib test suite into better shape and are close to migrating it to use the latest W3 test suite - where the problematic example originally posted (above) is an existing test datum (edit: incorrect, that test uses "\U0001f46a").

ajnelson-nist commented 2 years ago

I noticed one place in the current (aa6cde39) code uses the unicodedata built-in module. Is this thread stumbling on another use case for that module?

RDFLib / rdflib

SPARQL parser bug in evaluating unicode escapes #1884