Letractively / rdflib

Automatically exported from code.google.com/p/rdflib
Other
0 stars 0 forks source link

N3 serialization fails for certain Literals #184

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
A Serialization/deserialization roundtrip of a certain class of Literals fails 
when there are both, newline characters and multiple subsequent quotation marks 
in the lexical form of the Literal (see below). In this case invalid N3 is 
emitted by the serializer, which in turn cannot be parsed correctly.

What steps will reproduce the problem?
>>> from rdflib.term import URIRef, Literal
>>> from rdflib.graph import ConjunctiveGraph
>>> g=ConjunctiveGraph()
>>> g.add((URIRef('http://foobar'), URIRef('http://fooprop'), 
Literal('abc\ndef"""""')))
>>> g.serialize(format='n3') # emits invalid N3
'@prefix ns1: <http://> .\n\nns1:foobar ns1:fooprop """abc\ndef\\""""\\"""" 
.\n\n'
>>> g2=ConjunctiveGraph()
>>> g2.parse(data=g.serialize(format='n3'), format='n3')
------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython console>", line 1, in <module>
  File "/site-python/rdflib/graph.py", line 988, in parse
    location=location, file=file, data=data, **args)
  File "/site-python/rdflib/graph.py", line 784, in parse
    parser.parse(source, self, **args)
  File "/site-python/rdflib/plugins/parsers/notation3.py", line 2257, in parse
    p.loadStream(source.getByteStream())
  File "/site-python/rdflib/plugins/parsers/notation3.py", line 892, in loadStream
    return self.loadBuf(stream.read())   # Not ideal
  File "/site-python/rdflib/plugins/parsers/notation3.py", line 898, in loadBuf
    self.feed(buf)
  File "/site-python/rdflib/plugins/parsers/notation3.py", line 924, in feed
    i = self.directiveOrStatement(str,j)
  File "/site-python/rdflib/plugins/parsers/notation3.py", line 939, in directiveOrStatement
    if j>=0: return self.checkDot(str,j)
  File "/site-python/rdflib/plugins/parsers/notation3.py", line 1478, in checkDot
    str, j, "expected '.' or '}' or ']' at end of statement")
BadSyntax: at line 4 of <>:
Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in:
"...fix ns1: <http://> .

ns1:foobar ns1:fooprop """abc
def\""""^\"""" .

"
>>> 

What is the expected output? What do you see instead?
In correct output all quotation marks in the Literal should be escaped, except 
the trailing three ones. This can be reliably parsed.
>>> data='@prefix ns1: <http://> .\n\nns1:foobar ns1:fooprop 
"""abc\ndef\\"\\"\\"\\"\\"""" .\n\n'
>>> g3=ConjunctiveGraph()
>>> g3.parse(data=data, format='n3')
<Graph identifier=uTEqibmd364 (<class 'rdflib.graph.Graph'>)>
>>> g.isomorphic(g3)
True
>>> 

What version of the product are you using? On what operating system?
revision 37d3cbfff340

Please provide any additional information below.

Original issue reported on code.google.com by bernhard...@gmail.com on 23 Aug 2011 at 7:08

GoogleCodeExporter commented 9 years ago
I also came across this - it's even more broken with ntriples (which do not 
allow triple quoted strings) 

I will sit down and work the regex-fu some day... 

Original comment by gromgull on 23 Aug 2011 at 7:12

GoogleCodeExporter commented 9 years ago
For fixing the NT output I hacked a workaround some time ago, but this relies 
on valid N3 output. Its for sure not the most elegant way to do it, but I 
didn't find yet a case where it does not work (except the cases where invalid 
N3 is emitted).

Original comment by bernhard...@gmail.com on 23 Aug 2011 at 7:42

Attachments:

GoogleCodeExporter commented 9 years ago
Is this actually a bug or just an infelicity of expression, allowed by Python's 
extra-flexible string delimiters? 

N3 strings are delimited by doublequote, singlequote does not need to be 
escaped - so the object of the statement in question might be more 
appropriately, if more cumbersomely, expressed and escaped as: 

Literal("abc\\ndef\"\"\"\"\"")

And then the round-tripping flows smoothly ...

>>> from rdflib.term import URIRef, Literal
>>> from rdflib.graph import ConjunctiveGraph
>>> g=ConjunctiveGraph()
>>> g.add((URIRef('http://foobar'), URIRef('http://fooprop'), 
Literal("abc\\ndef\"\"\"\"\"")))
>>> g.serialize(format='n3') # would appear to emit valid N3 this time
'@prefix ns1: <http://> .\n\nns1:foobar ns1:fooprop 
"abc\\\\ndef\\"\\"\\"\\"\\"" .\n\n'
>>> g2=ConjunctiveGraph()
>>> g2.parse(data=g.serialize(format='n3'), format='n3')
<Graph identifier=FlMFGRpm27 (<class 'rdflib.graph.Graph'>)>
>>> g2.serialize(format="n3")
'@prefix ns1: <http://> .\n\nns1:foobar ns1:fooprop 
"abc\\\\ndef\\"\\"\\"\\"\\"" .\n\n'
>>> print(g2.serialize(format="n3"))
@prefix ns1: <http://> .

ns1:foobar ns1:fooprop "abc\\ndef\"\"\"\"\"" .

>>> g.isomorphic(g2)
True
>>> 

Original comment by gjhigg...@gmail.com on 24 Oct 2011 at 12:03

GoogleCodeExporter commented 9 years ago
There is a mistake in your code, the string you used is not the same as in my 
example, as you inserted a double backslash. 

>>> a='abc\ndef"""""'
>>> b="abc\\ndef\"\"\"\"\""
>>> a==b
False

The problem comes from the combination of double quotes and newlines:

>>> a='abc\ndef"""""'
>>> b="abc\ndef\"\"\"\"\""
>>> a==b
True

>>> from rdflib.term import URIRef, Literal
>>> from rdflib.graph import ConjunctiveGraph
>>> g=ConjunctiveGraph()
>>> g.add((URIRef('http://foobar'), URIRef('http://fooprop'), 
Literal("abc\ndef\"\"\"\"\""))) # note: \n but not \\n
>>> g2=ConjunctiveGraph()
>>> g2.parse(data=g.serialize(format='n3'), format='n3')
------------------------------------------------------------
Traceback (most recent call last):
  File "<ipython console>", line 1, in <module>
  File "/Users/bs/Data/work/projects/gnowsis/gnowsisweb/site-python/rdflib/graph.py", line 988, in parse
    location=location, file=file, data=data, **args)
  File "/Users/bs/Data/work/projects/gnowsis/gnowsisweb/site-python/rdflib/graph.py", line 784, in parse
    parser.parse(source, self, **args)
  File "/Users/bs/Data/work/projects/gnowsis/gnowsisweb/site-python/rdflib/plugins/parsers/notation3.py", line 2257, in parse
    p.loadStream(source.getByteStream())
  File "/Users/bs/Data/work/projects/gnowsis/gnowsisweb/site-python/rdflib/plugins/parsers/notation3.py", line 892, in loadStream
    return self.loadBuf(stream.read())   # Not ideal
  File "/Users/bs/Data/work/projects/gnowsis/gnowsisweb/site-python/rdflib/plugins/parsers/notation3.py", line 898, in loadBuf
    self.feed(buf)
  File "/Users/bs/Data/work/projects/gnowsis/gnowsisweb/site-python/rdflib/plugins/parsers/notation3.py", line 924, in feed
    i = self.directiveOrStatement(str,j)
  File "/Users/bs/Data/work/projects/gnowsis/gnowsisweb/site-python/rdflib/plugins/parsers/notation3.py", line 939, in directiveOrStatement
    if j>=0: return self.checkDot(str,j)
  File "/Users/bs/Data/work/projects/gnowsis/gnowsisweb/site-python/rdflib/plugins/parsers/notation3.py", line 1478, in checkDot
    str, j, "expected '.' or '}' or ']' at end of statement")
BadSyntax: at line 4 of <>:
Bad syntax (expected '.' or '}' or ']' at end of statement) at ^ in:
"...fix ns1: <http://> .

ns1:foobar ns1:fooprop """abc
def\""""^\"""" .

"

Original comment by bernhard...@gmail.com on 24 Oct 2011 at 5:10

GoogleCodeExporter commented 9 years ago
Sorry about that, I was focussed on the wrong area. You're quite correct, it 
was an issue of serialization - in Literal._quote_encode.

Changesets 7ac96f5a5a24 and 372e190cf28d fix this issue and add an 
issue-specific test. 

Original comment by gjhigg...@gmail.com on 25 Oct 2011 at 4:40