Closed GreenfishK closed 2 years ago
would be interesting to know if this works with RDF4J and Jena, will take a look when I can, possibly only next week somewhere.
AFAICS this is not valid Turtle. The parens need to be escaped, like ns2:spouse\(s\)_
, otherwise this cannot be disambiguated from a list, like: <> ns2:spouse(ns2:s).
(which is valid Turtle).
The nt cannot be the same as NTriples cannot contain PNames.
(Also, the ns2
prefix appears undeclared in the example; typo of ns1
I presume?)
AFAICS this is not valid Turtle. The parens need to be escaped, like
ns2:spouse\(s\)_
, otherwise this cannot be disambiguated from a list, like:<> ns2:spouse(ns2:s).
(which is valid Turtle).The nt cannot be the same as NTriples cannot contain PNames.
(Also, the
ns2
prefix appears undeclared in the example; typo ofns1
I presume?)
Ok, so it is not a valid turtle, but is it a valid nt?:
<http://dbpedia.org/resource/Maine_Mendoza> <http://dbpedia.org/property/spouse(s)_> "Nate Winchester 2013-Present"@en .
The nt parser has no problem with it.
Yes, the ns2 was a typo.
Yes, that line is valid NT and valid Turtle. It's the PName (which is only in Turtle) which is invalid, as the parens must be escaped in a PName.
I think RDFLib outputs invalid PNames here though, so it would fail on round-tripping.
Yes, that line is valid NT and valid Turtle. It's the PName (which is only in Turtle) which is invalid, as the parens must be escaped in a PName.
I think RDFLib outputs invalid PNames here though, so it would fail on round-tripping.
I created the triple by parsing the nt triple and serializing it as ttl. So the problem is actually with the serializer.
@GreenfishK can you please create a new issue to indicate the serialization error and close this one? Just so we can keep on top of all the issues. Thanks.
Is it OK to just rephrase the title? The issue is with serializing (and thus round-tripping) the given data. I took a stab at it, and a fix in rdflib/plugins/serializers/turtle.py
should be along the lines of:
import re
...
# From: https://www.w3.org/TR/turtle/#grammar-production-PN_LOCAL_ESC
# Note that '_.-' are OK within, but need escaping at start/end.
PNAME_LOCAL_ESCAPES = re.compile(r"([~!$&'()*+,;=/?#@%])")
...
def getQName(self, uri, gen_prefix=True): # should be renamed to get_pname ...
....
local = PNAME_LOCAL_ESCAPES.sub(r"\\\1", local)
....
(Sorry for just suggesting the fix in a comment. If I find more time I can make a PR, albeit I suspect I'd like to overhaul the Turtle handling thoroughly (using PNames consistently), and then I'd end up rewriting this and the parser.)
I've just updated the title.
I'd end up rewriting this and the parser
The Turtle parser is ancient and a bit of a mess. The N-Triples parser/serializer also needs attention. It would be great to see an overhaul although I fear this would be a large task. I suspect that modern Python domain-specific language handling could greatly improve things here but I don't know the area much...
...and the current parsers/serializers to mostly work of course!
Recently I added the longturtle
turtle variant serializer to allow for Git-friendly Turtle files, so I got into the serializer a bit there but just wrapped the underlying N3 serializer with some slightly different, still valid Turtle, conventions, rather than implementing an actual, new, serializer. I'm a coward!
@niklasl if you want to do a major rewrite of any of these things, let us know! It could be a great consider that a great rdflib 7.0.0 candidate update. However, it would also be great just to see a small PR for this PName fix by itself!
I will see about integrating @niklasl 's suggestion tonight. I think we should have a discussion about the future of the parsers though, or at least discuss some options, but like @nicholascar said for now we will do our best to get the ones we have in the best shape and hopefully that leaves us with a good test suite to ensure future parsers behave well.
Thank you all!
I have been looking at this a bit, the only cases in which invalid PNames are returned are:
FAILED test/test_turtle_quoting.py::test_serialize_correctness[turtle-(] - rdflib.plugins.parsers.notation3.BadSyntax: <no detail available>
FAILED test/test_turtle_quoting.py::test_serialize_correctness[turtle-)] - rdflib.plugins.parsers.notation3.BadSyntax: <no detail available>
FAILED test/test_turtle_quoting.py::test_serialize_correctness[turtle-%] - rdflib.plugins.parsers.notation3.BadSyntax: <no detail available>
In all these cases there is no problem:
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-A]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-_]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-~]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-.]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle--]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-!]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-$]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-&]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-']
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-*]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-+]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-,]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-;]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-=]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-/]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-?]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-#]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-@]
And the reason it fails is actually because of is_ncname in rdflib.namespace:
Basically, if "%", "(" and ")" is removed from there all turtle test pass, of course that is not a fix for this, but rather this code is XML specific and should probably not run at all for Turtle based syntax.
@niklasl 's suggestion will fix it though, but will have to dig a bit into this at some point to try and rationalize what is going on here.
Draft fix in https://github.com/RDFLib/rdflib/pull/1678 - however this causes one regression, in that it breaks percent encoded stuff:
$ .venv/bin/pytest 'test/test_issue801.py::TestIssue801::test_issue_801'
============================================================================ test session starts ============================================================================
platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/iwana/sw/d/github.com/iafork/rdflib, configfile: tox.ini
plugins: subtests-0.5.0, cov-3.0.0
collected 1 item
test/test_issue801.py F [100%]
================================================================================= FAILURES ==================================================================================
________________________________________________________________________ TestIssue801.test_issue_801 ________________________________________________________________________
self = <test.test_issue801.TestIssue801 testMethod=test_issue_801>
def test_issue_801(self):
g = Graph()
example = Namespace("http://example.org/")
g.bind("", example)
node = BNode()
g.add((node, example["first%20name"], Literal("John")))
> self.assertEqual(
g.serialize(format="turtle").split("\n")[-3], '[] :first%20name "John" .'
)
E AssertionError: '[] :first\\%20name "John" .' != '[] :first%20name "John" .'
E - [] :first\%20name "John" .
E ? -
E + [] :first%20name "John" .
test/test_issue801.py:15: AssertionError
========================================================================== short test summary info ==========================================================================
FAILED test/test_issue801.py::TestIssue801::test_issue_801 - AssertionError: '[] :first\\%20name "John" .' != '[] :first%20name "John" .'
============================================================================= 1 failed in 0.18s =============================================================================
Possibly we should just go very conservative and only deal with (
and )
as these are actually the main problems.
So something like what @niklasl suggested but with:
PNAME_LOCAL_ESCAPES = re.compile(r"([()])")
Potentially also with a different name.
I think there may be better places to fix this though, but will look at it more tomorrow. I think a targetted fix for only the brackets may make sense though, and then we can unravel the rest separately.
Possibly we should just go very conservative...
Sure, right down to solving the particular motivating Issue here and no more, unless there are some obvious other cases of things to handle. But don't put in much more effort than that!
@GreenfishK https://github.com/RDFLib/rdflib/pull/1678 should address your issue, but there may still be some further issues with turtle pname serialization that need fixing.
Consider following triple from DBPedia serialized as .ttl:
test.ttl
It is not possible to parse this triple due to the brackets in ns2:spouses(s)_ in a .ttl file. However, there is no such problem for an .nt file.
Code to parse the test.ttl file with the triple inside.
Exception: