RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 555 forks source link

NQuads: unicode escape issue #352

Closed jmahmud closed 10 years ago

jmahmud commented 10 years ago

Hi

I have an issue parse nquads with rdflib.

I have this string:

'Production date :: 2532\u20132503BC :: circa'

The triple is in a file (see attached) - when it is parsed by rdflib I get an error: ValueError: chr() arg not in range(0x110000). This seems to be generated in py3compat.py.

Digging into it is seems that the regex is correctly identifying the unicode escape, but when it wants to pull it out, it grabs all of the intergers after the \u2013 which includes the '2503' characters, i.e: '\u20132503' The line that is failing is:

r_unicodeEscape = re.compile(r'(\\[uU][0-9A-Fa-f]{4}(?:[0-9A-Fa-f]{4})?)')
def _unicodeExpand(s):
    return r_unicodeEscape.sub(lambda m: chr(int(m.group(0)[2:], 16)), s)  #this line I get the error

What does seem to work is either: changing the m.group(0) to m.group(1) OR using the regex for unicode escape characters as defined in the notation3.py: (https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/parsers/notation3.py):

unicodeEscape4 = re.compile(r'\\u([0-9a-f]{4})', flags=re.I)
unicodeEscape8 = re.compile(r'\\U([0-9a-f]{8})', flags=re.I)
def _unicodeExpand(s):
    a = unicodeEscape4.sub(lambda m: chr(int(m.group(0)[2:], 16)), s) 
    return unicodeEscape8.sub(lambda m: chr(int(m.group(0)[2:], 16)), a)

Would this be an adequate solution?

Thanks Josh

joernhees commented 10 years ago

group(1) is not an option, as it's wrong for \UXXXXXXXX escapes.

joernhees commented 10 years ago

i think you can just use this regexp:

r_unicodeEscape = re.compile(r'(\\u[0-9A-Fa-f]{4}|\\U[0-9A-Fa-f]{8})?)')