RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.17k stars 556 forks source link

Turtle serializing creates invalid PNames #1661

Closed GreenfishK closed 2 years ago

GreenfishK commented 2 years ago

Consider following triple from DBPedia serialized as .ttl:

test.ttl

@prefix ns1: <http://dbpedia.org/property/> .
<http://dbpedia.org/resource/Maine_Mendoza> ns2:spouse(s)_ "Nate Winchester 2013-Present"@en .

It is not possible to parse this triple due to the brackets in ns2:spouses(s)_ in a .ttl file. However, there is no such problem for an .nt file.

Code to parse the test.ttl file with the triple inside.

from rdflib import Graph
g = Graph()
g.parse(test.ttl)

Exception:

Traceback (most recent call last):
  File "/home/fkovacev/PhD/RDFStarVersioningEval/OSTRICH/BEAR/scripts/build_tb_rdf_star_datasets.py", line 204, in <module>
    construct_tb_star_ds(source_ic0=data_dir + "/alldata.IC.nt/000001.nt",
  File "/home/fkovacev/PhD/RDFStarVersioningEval/OSTRICH/BEAR/scripts/build_tb_rdf_star_datasets.py", line 140, in construct_tb_star_ds
    cs_add.parse(source_cs + "/" + t[1])
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/graph.py", line 1261, in parse
    raise se
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/graph.py", line 1252, in parse
    parser.parse(source, self, **args)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1921, in parse
    p.loadStream(stream)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 438, in loadStream
    return self.loadBuf(stream.read())  # Not ideal
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 444, in loadBuf
    self.feed(buf)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 470, in feed
    i = self.directiveOrStatement(s, j)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 490, in directiveOrStatement
    j = self.statement(argstr, i)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 740, in statement
    j = self.property_list(argstr, i, r[0])
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1088, in property_list
    i = self.objectList(argstr, j, objs)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1132, in objectList
    i = self.object(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1418, in object
    j = self.subject(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 747, in subject
    return self.item(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 839, in item
    return self.path(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 846, in path
    j = self.nodeOrLiteral(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1446, in nodeOrLiteral
    j = self.node(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1025, in node
    self.BadSyntax(argstr, i, "expected item in list or ')'")
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1653, in BadSyntax
    raise BadSyntax(self._thisDoc, self.lines, argstr, i, msg)
rdflib.plugins.parsers.notation3.BadSyntax: at line 3 of <>:
Bad syntax (expected item in list or ')') at ^ in:
"...b'> .\n\n<http://dbpedia.org/resource/Maine_Mendoza> ns1:spouse('^b's)_ "Nate Winchester 2013-Present"@en .\n\n\n'"
aucampia commented 2 years ago

would be interesting to know if this works with RDF4J and Jena, will take a look when I can, possibly only next week somewhere.

niklasl commented 2 years ago

AFAICS this is not valid Turtle. The parens need to be escaped, like ns2:spouse\(s\)_, otherwise this cannot be disambiguated from a list, like: <> ns2:spouse(ns2:s). (which is valid Turtle).

The nt cannot be the same as NTriples cannot contain PNames.

(Also, the ns2 prefix appears undeclared in the example; typo of ns1 I presume?)

GreenfishK commented 2 years ago

AFAICS this is not valid Turtle. The parens need to be escaped, like ns2:spouse\(s\)_, otherwise this cannot be disambiguated from a list, like: <> ns2:spouse(ns2:s). (which is valid Turtle).

The nt cannot be the same as NTriples cannot contain PNames.

(Also, the ns2 prefix appears undeclared in the example; typo of ns1 I presume?)

Ok, so it is not a valid turtle, but is it a valid nt?: <http://dbpedia.org/resource/Maine_Mendoza> <http://dbpedia.org/property/spouse(s)_> "Nate Winchester 2013-Present"@en . The nt parser has no problem with it.

Yes, the ns2 was a typo.

niklasl commented 2 years ago

Yes, that line is valid NT and valid Turtle. It's the PName (which is only in Turtle) which is invalid, as the parens must be escaped in a PName.

I think RDFLib outputs invalid PNames here though, so it would fail on round-tripping.

GreenfishK commented 2 years ago

Yes, that line is valid NT and valid Turtle. It's the PName (which is only in Turtle) which is invalid, as the parens must be escaped in a PName.

I think RDFLib outputs invalid PNames here though, so it would fail on round-tripping.

I created the triple by parsing the nt triple and serializing it as ttl. So the problem is actually with the serializer.

nicholascar commented 2 years ago

@GreenfishK can you please create a new issue to indicate the serialization error and close this one? Just so we can keep on top of all the issues. Thanks.

niklasl commented 2 years ago

Is it OK to just rephrase the title? The issue is with serializing (and thus round-tripping) the given data. I took a stab at it, and a fix in rdflib/plugins/serializers/turtle.py should be along the lines of:


import re
...
# From: https://www.w3.org/TR/turtle/#grammar-production-PN_LOCAL_ESC
# Note that '_.-' are OK within, but need escaping at start/end.
PNAME_LOCAL_ESCAPES = re.compile(r"([~!$&'()*+,;=/?#@%])")
...

    def getQName(self, uri, gen_prefix=True): # should be renamed to get_pname ...
        ....
        local = PNAME_LOCAL_ESCAPES.sub(r"\\\1", local)
        ....

(Sorry for just suggesting the fix in a comment. If I find more time I can make a PR, albeit I suspect I'd like to overhaul the Turtle handling thoroughly (using PNames consistently), and then I'd end up rewriting this and the parser.)

nicholascar commented 2 years ago

I've just updated the title.

I'd end up rewriting this and the parser

The Turtle parser is ancient and a bit of a mess. The N-Triples parser/serializer also needs attention. It would be great to see an overhaul although I fear this would be a large task. I suspect that modern Python domain-specific language handling could greatly improve things here but I don't know the area much...

...and the current parsers/serializers to mostly work of course!

Recently I added the longturtle turtle variant serializer to allow for Git-friendly Turtle files, so I got into the serializer a bit there but just wrapped the underlying N3 serializer with some slightly different, still valid Turtle, conventions, rather than implementing an actual, new, serializer. I'm a coward!

@niklasl if you want to do a major rewrite of any of these things, let us know! It could be a great consider that a great rdflib 7.0.0 candidate update. However, it would also be great just to see a small PR for this PName fix by itself!

aucampia commented 2 years ago

I will see about integrating @niklasl 's suggestion tonight. I think we should have a discussion about the future of the parsers though, or at least discuss some options, but like @nicholascar said for now we will do our best to get the ones we have in the best shape and hopefully that leaves us with a good test suite to ensure future parsers behave well.

GreenfishK commented 2 years ago

Thank you all!

aucampia commented 2 years ago

I have been looking at this a bit, the only cases in which invalid PNames are returned are:

FAILED test/test_turtle_quoting.py::test_serialize_correctness[turtle-(] - rdflib.plugins.parsers.notation3.BadSyntax: <no detail available>
FAILED test/test_turtle_quoting.py::test_serialize_correctness[turtle-)] - rdflib.plugins.parsers.notation3.BadSyntax: <no detail available>
FAILED test/test_turtle_quoting.py::test_serialize_correctness[turtle-%] - rdflib.plugins.parsers.notation3.BadSyntax: <no detail available>

In all these cases there is no problem:

PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-A]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-_]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-~]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-.]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle--]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-!]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-$]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-&]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-']
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-*]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-+]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-,]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-;]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-=]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-/]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-?]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-#]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-@]

Test code: https://github.com/aucampia/rdflib/blob/092f228752a89e2c1eecac376a7c148e64a83241/test/test_turtle_quoting.py#L155-L176

And the reason it fails is actually because of is_ncname in rdflib.namespace:

https://github.com/RDFLib/rdflib/blob/c0bd5eaaa983461445b9469d731be4ae0e0cfc54/rdflib/namespace/__init__.py#L643-L669

Basically, if "%", "(" and ")" is removed from there all turtle test pass, of course that is not a fix for this, but rather this code is XML specific and should probably not run at all for Turtle based syntax.

@niklasl 's suggestion will fix it though, but will have to dig a bit into this at some point to try and rationalize what is going on here.

aucampia commented 2 years ago

Draft fix in https://github.com/RDFLib/rdflib/pull/1678 - however this causes one regression, in that it breaks percent encoded stuff:

$ .venv/bin/pytest  'test/test_issue801.py::TestIssue801::test_issue_801'
============================================================================ test session starts ============================================================================
platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/iwana/sw/d/github.com/iafork/rdflib, configfile: tox.ini
plugins: subtests-0.5.0, cov-3.0.0
collected 1 item                                                                                                                                                            

test/test_issue801.py F                                                                                                                                               [100%]

================================================================================= FAILURES ==================================================================================
________________________________________________________________________ TestIssue801.test_issue_801 ________________________________________________________________________

self = <test.test_issue801.TestIssue801 testMethod=test_issue_801>

    def test_issue_801(self):
        g = Graph()
        example = Namespace("http://example.org/")
        g.bind("", example)
        node = BNode()
        g.add((node, example["first%20name"], Literal("John")))
>       self.assertEqual(
            g.serialize(format="turtle").split("\n")[-3], '[] :first%20name "John" .'
        )
E       AssertionError: '[] :first\\%20name "John" .' != '[] :first%20name "John" .'
E       - [] :first\%20name "John" .
E       ?          -
E       + [] :first%20name "John" .

test/test_issue801.py:15: AssertionError
========================================================================== short test summary info ==========================================================================
FAILED test/test_issue801.py::TestIssue801::test_issue_801 - AssertionError: '[] :first\\%20name "John" .' != '[] :first%20name "John" .'
============================================================================= 1 failed in 0.18s =============================================================================

Possibly we should just go very conservative and only deal with ( and ) as these are actually the main problems.

So something like what @niklasl suggested but with:

PNAME_LOCAL_ESCAPES = re.compile(r"([()])")

Potentially also with a different name.

I think there may be better places to fix this though, but will look at it more tomorrow. I think a targetted fix for only the brackets may make sense though, and then we can unravel the rest separately.

nicholascar commented 2 years ago

Possibly we should just go very conservative...

Sure, right down to solving the particular motivating Issue here and no more, unless there are some obvious other cases of things to handle. But don't put in much more effort than that!

aucampia commented 2 years ago

@GreenfishK https://github.com/RDFLib/rdflib/pull/1678 should address your issue, but there may still be some further issues with turtle pname serialization that need fixing.