Invalid URI - Githubissues

petrux commented 10 years ago

I got this:

File "ws23/ws23.py", line 35, in web_search_to_triples triples = g.serialize(format=RDFLIB_FORMAT).split("\n")
File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 936, in serialize serializer.serialize(stream, base=base, encoding=encoding, **args)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/serializers/turtle.py", line 208, in serialize if self.statement(subject) and not firstTime:
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/serializers/n3.py", line 92, in statement or super(N3Serializer, self).statement(subject))
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/serializers/turtle.py", line 269, in statement return self.s_squared(subject) or self.s_default(subject)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/serializers/turtle.py", line 282, in s_squared self.predicateList(subject)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/serializers/turtle.py", line 373, in predicateList self.objectList(properties[propList[0]])
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/serializers/turtle.py", line 388, in objectList self.path(objects[0], OBJECT)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/serializers/n3.py", line 96, in path super(N3Serializer, self).path(node, position, newline)
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/serializers/turtle.py", line 288, in path or self.p_default(node, position, newline)):
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/serializers/turtle.py", line 294, in p_default self.write(self.label(node, position))
File "/usr/local/lib/python2.7/dist-packages/rdflib/plugins/serializers/turtle.py", line 310, in label return self.getQName(node, position == VERB) or node.n3()
File "/usr/local/lib/python2.7/dist-packages/rdflib/term.py", line 224, in n3 raise Exception('"%s" does not look like a valid URI, I cannot serialize this as N3/Turtle. Perhaps you wanted to urlencode it?'%self)
Exception: "http://www.imdb.com/title/tt0091369//search/title?locations=West Wycombe Park, West Wycombe, Buckinghamshire, England, UK&ref_=tt_dt_dt" does not look like a valid URI, I cannot serialize this as N3/Turtle. Perhaps you wanted to urlencode it?

This is the full triple:

_:node81b1978fce492c4b779bdd9d709f9e7f <http://schema.org/Movie/url> <http://www.imdb.com/title/tt0091369//search/title?locations=West Wycombe Park, West Wycombe, Buckinghamshire, England, UK&ref_=tt_dt_dt> .

I could observe that the triple is correctly parsed but cannot be serialized:

from rdflib import Graph
t = "_:node81b1978fce492c4b779bdd9d709f9e7f <http://schema.org/Movie/url> <http://www.imdb.com/title/tt0091369//search/title?locations=West Wycombe Park, West Wycombe, Buckinghamshire, England, UK&ref_=tt_dt_dt> ."
g = Graph()
g.parse(data=t, format="n3")
for s, p, o in g:
    print s, p, o
g.serialize(format="n3")

Please, tell me if you need more details.

joernhees commented 10 years ago

Where did you get the file from? It seems broken as it contains a URI with spaces which need to be escaped.

if you have logging enabled like this:

import logging
logging.basicConfig(level=logging.INFO)

you'll see a warning printed in your parsing step:

WARNING:rdflib.term:http://www.imdb.com/title/tt0091369//search/title?locations=West Wycombe Park, West Wycombe, Buckinghamshire, England, UK&ref_=tt_dt_dt does not look like a valid URI, trying to serialize this will break.

The error message already points you to urlencode...

@gromgull by default the warning is hidden (No handlers could be found for logger "rdflib.term"), is that intentional?

petrux commented 10 years ago

@joernhees great, thanks a lot. So do you think I need to url-encode all the triples while serializing?

petrux commented 10 years ago

Sorry for coming back but... really, I cannot figure out how to deal with this proble. RDFLib parses an invalid URI and than crashes while serializing it. So should I pre-parse it and urlencode all the URIs?

gromgull commented 10 years ago

Your input file is broken, rdflib is being lenient in what it accepts, but will refuse to write incorrect RDF.

It's not trivial to fix, since you don't know what the URL should be really like, for the one example URL here you could try to urllib.quote the part after the last slash, i.e.

You can parse the file, the run something like:

def fix(s):
    i = s.rindex('/')
    return s[:i]+urllib.quote(s[i:])

graph = rdflib.Graph()
graph.parse('your_broken_input.xxx', format='xxx') # fill in xxx
fixedgraph = rdflib.Graph()

fixedgraph += [ (fix(s) if ' ' in s else s, p, fix(o) if ' ' in o else o) for s,p,o in graph ]

(untested, but almost right :)

petrux commented 10 years ago

Hi @gromgull and thanks for your help. Let me explain: I'm trying to turn web search result into triples with stupid script which is just for didactic purposes. To retrieve semantic tags from web pages I use Any23. The broken triple is returned by this service when scanning the url http://www.imdb.com/title/tt0091369/. For the sake of precision, the query returning the broken triple is this one.

So, your suggestion is to parse all the triples, fix them and then create another graph. It sounds a bit expansive but I think it could go (as my application has no real amibtion). Anyway, I think that the best solution would be to pre-parse the input and feed the Graph instance only with validated triples, do you agree?

Finally: I'll report to Any23 maintainers.

gromgull commented 10 years ago

@joernhees : "intentional" - by now you should know that nothing in rdflib is intentional - development is an organic unguided process, like evolution :)

Seriously though - RDFLib is a library, we probably have no business configuring loggers, tools like rdfpipe etc. should do though. We could probably reconsider and document WHAT loggers we actually log to though - rdflib.term seems to be a bit too specific. I'll make a new issue for this though!

petrux commented 10 years ago

@gromgull following this StackOverflow answer, I'm trying to put the Werkzeug url fix function in (and, contextually, trying to learn writing werkzeug).

Just to ask and for the sake of curiosity: would it be a good idea to "embed" a validation-and-fix mechanism in rflib parsing?

gromgull commented 10 years ago

The fix really doesn't belong in rdflib - your input file is broken - I think by parsing it we've already been lenient enough :) Also we already have enough dependencies :)

Once #411 is merged, you could write a rdffixer that parses/fixes/serializes a stream of nquads or similar - which would be a neat separate tool!

(But don't hold your breath for #411 - it's years in the making already)

joernhees commented 10 years ago

@petrux the problem with automagically fixing broken encodings is that it's far from trivial and often hiding the actual problem ;-/

petrux commented 10 years ago

@gromgull and @joernhees: thanks for your replies. Just to ask (again), which is the piece of coding actually implementing the N3 parsing? So thah I can take inspiration for fix the broken input.

gromgull commented 10 years ago

The parser is here: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/parsers/notation3.py but it's fine and needs no fixes.

The problem is that the IRI specs (https://tools.ietf.org/html/rfc3987) disallows spaces and a handful of other characters in IRIs: http://www.w3.org/TR/turtle/#grammar-production-IRIREF

petrux commented 10 years ago

@gromgull I don't want to fix it in any way, indeed. My idea is to rip off the parser logic, pre-parse the input, sanitize it and feed the Graph only with well-formed URIs. Thanks.

gromgull commented 10 years ago

OK! The n3 parser is not the cleanest code we have - so good luck! :)

petrux commented 10 years ago

D'oh!!! :-)

petrux commented 10 years ago

@gromgull I tried (kind of) your snippet

g = Graph()
tmp_g = Graph()
tmp_g.parse(data=..., format=...)
g +=  [(url_fix(s), url_fix(p), url_fix(o)) for s, p, o in tmp_g]

and had:

File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 559, in __iadd__ self.addN((s, p, o, self) for s, p, o in other)
File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 402, in addN self.__store.addN((s, p, o, c) for s, p, o, c in quads
File "/usr/local/lib/python2.7/dist-packages/rdflib/store.py", line 221, in addN for s, p, o, c in quads:
File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 405, in <genexpr> and _assertnode(s,p,o)
File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 1903, in _assertnode
'Term %s must be an rdflib term' % (t,)
AssertionError: Term ff1e485ad8d8c4f4f9d092a9af2a3369cb1 must be an rdflib term

Any hint?

gromgull commented 10 years ago

Just as it says - the term passed in has to be a subclass of rdflib node - try creating an URIRef from the string before returning from the fix

function.

http://gromgull.net

petrux commented 10 years ago

I ended up into:

def url_fix(u):
    ...

def sanitize_triple(t):

    def sanitize_triple_item(item):
        if isinstance(item, term.URIRef):
            return term.URIRef(url_fix(str(item)))
        return item

    return (sanitize_triple_item(t[0]),
            sanitize_triple_item(t[1]),
            sanitize_triple_item(t[2]))

and everything seems to be OK. Anyway, I noticed that if I serialize the very same graph to XML (instead of N3), everything works fine. Why? Is it an expected behavior? Thanks.

jpmccu commented 8 years ago

Here's a legit URL that isn't being recognized as one by RDFlib:

http://fonts.googleapis.com/css?family=Nunito|Open%20Sans:300,400,600,700&amp;subset=latin

From the BTC 2014 dataset.

legel commented 7 years ago

Just another "user error" to report: WARNING:rdflib.term:https://allevents.in/santa rita does not look like a valid URI, trying to serialize this will break.

Of course I have no control over the way developers encode their URIs (correctly or not), and the scope of correcting errors in semantic encoding is so vast as to warrant academic study (e.g. http://dl.acm.org/citation.cfm?id=2950981). But why not help everyone move in the right direction by at least attempting to auto-correct the simple and extremely common stuff like " " -> "%20"?

Or, if you agree this is out of the scope of this library, then how about we agree to silence this warning altogether, by moving this to the 'INFO' category of logging? Indeed, RDFLib is not meant for resolving and requesting data from URLs. That job is better suited for Requests, which indeed handles the following use case admirably:

import requests
html = requests.get(url="https://allevents.in/santa rita").content
print html

So I'm requesting we "put up or shut up". In the meantime people can silence this warning by just changing term.py line 208 from logger.warning(...) to logger.info(...).

joernhees commented 7 years ago

i object auto-correcting such things as long term it will introduce more errors than it solves.

Let's extend your example a bit...:

https://allevents.in/santa rita
https://allevents.in/santa%20rita?query= foo bar&bla  # only query part unescaped? is the & part of the query value or a new param?
https://allevents.in/santa+rita?query= foo bar  # other common " " replacement, should query part do it similar?
https://allevents.in/santa_rita?query= foo bar  # wikipedia " " replacement
http://example.com/jörn  # did you actually mean the IRI (UTF-8 'ö') or URI ('%C3%B6')?
...

I see that it is tempting to say "auto-correct the simple and extremely common stuff". However, we have to weigh this against providing a consistent, deterministic lib. I'm quite convinced that the way we handle this, namely expect the developer to give us correct URIs is the least problematic in the end.

Given that: warnings in early development are a good way to make a developer aware. If some developer uses invalid URIs as URIs, then they should definitely know about this. In production code without a configured logger that warning isn't shown if i'm not mistaken. The other cases are:

you have a logger configured (then please configure it as you like, the rdflib logging messages are in the rdflib namespace, if you want to silence them, do so)
if you don't have a logger configured:
- if you're in interactive mode, then you're probably developing and should know
- if you're not, then the warnings aren't shown

jpmccu commented 7 years ago

If we're going to object to auto-correct, I'd also object to the absolutization of URIs. Relative URIs are actually very useful when processing and managing data, and should not be automatically changed.

niklasl commented 7 years ago

Since auto-correct and URI resolution are quite different (and applying the latter is in the specs), that's ought to be a distinct issue/feature request?

(IIRC, RDFLib allows you to leave off a base URI and then won't tamper with relative URIs. Not sure if that's the case for all syntax processors though, and does go beyond what the RDF specs say.)

ReshmaDangol commented 7 years ago

Is there a way to skip the triples (while parsing) with such invalid URI?

jpmccu commented 7 years ago

@niklasl leaving the base off isn't honored at all. It will almost always add on the current working directory.

RDFLib / rdflib

Invalid URI #412

function.