Closed petrux closed 10 years ago
Where did you get the file from? It seems broken as it contains a URI with spaces which need to be escaped.
if you have logging enabled like this:
import logging
logging.basicConfig(level=logging.INFO)
you'll see a warning printed in your parsing step:
WARNING:rdflib.term:http://www.imdb.com/title/tt0091369//search/title?locations=West Wycombe Park, West Wycombe, Buckinghamshire, England, UK&ref_=tt_dt_dt does not look like a valid URI, trying to serialize this will break.
The error message already points you to urlencode
...
@gromgull by default the warning is hidden (No handlers could be found for logger "rdflib.term"
), is that intentional?
@joernhees great, thanks a lot. So do you think I need to url-encode all the triples while serializing?
Sorry for coming back but... really, I cannot figure out how to deal with this proble. RDFLib parses an invalid URI and than crashes while serializing it. So should I pre-parse it and urlencode all the URIs?
Your input file is broken, rdflib is being lenient in what it accepts, but will refuse to write incorrect RDF.
It's not trivial to fix, since you don't know what the URL should be really like, for the one example URL here you could try to urllib.quote the part after the last slash, i.e.
You can parse the file, the run something like:
def fix(s):
i = s.rindex('/')
return s[:i]+urllib.quote(s[i:])
graph = rdflib.Graph()
graph.parse('your_broken_input.xxx', format='xxx') # fill in xxx
fixedgraph = rdflib.Graph()
fixedgraph += [ (fix(s) if ' ' in s else s, p, fix(o) if ' ' in o else o) for s,p,o in graph ]
(untested, but almost right :)
Hi @gromgull and thanks for your help. Let me explain: I'm trying to turn web search result into triples with stupid script which is just for didactic purposes. To retrieve semantic tags from web pages I use Any23. The broken triple is returned by this service when scanning the url http://www.imdb.com/title/tt0091369/. For the sake of precision, the query returning the broken triple is this one.
So, your suggestion is to parse all the triples, fix them and then create another graph. It sounds a bit expansive but I think it could go (as my application has no real amibtion). Anyway, I think that the best solution would be to pre-parse the input and feed the Graph
instance only with validated triples, do you agree?
Finally: I'll report to Any23 maintainers.
@joernhees : "intentional" - by now you should know that nothing in rdflib is intentional - development is an organic unguided process, like evolution :)
Seriously though - RDFLib is a library, we probably have no business configuring loggers, tools like rdfpipe etc. should do though. We could probably reconsider and document WHAT loggers we actually log to though - rdflib.term
seems to be a bit too specific. I'll make a new issue for this though!
The fix really doesn't belong in rdflib - your input file is broken - I think by parsing it we've already been lenient enough :) Also we already have enough dependencies :)
Once #411 is merged, you could write a rdffixer that parses/fixes/serializes a stream of nquads or similar - which would be a neat separate tool!
(But don't hold your breath for #411 - it's years in the making already)
@petrux the problem with automagically fixing broken encodings is that it's far from trivial and often hiding the actual problem ;-/
@gromgull and @joernhees: thanks for your replies. Just to ask (again), which is the piece of coding actually implementing the N3 parsing? So thah I can take inspiration for fix the broken input.
The parser is here: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/parsers/notation3.py but it's fine and needs no fixes.
The problem is that the IRI specs (https://tools.ietf.org/html/rfc3987) disallows spaces and a handful of other characters in IRIs: http://www.w3.org/TR/turtle/#grammar-production-IRIREF
@gromgull I don't want to fix it in any way, indeed. My idea is to rip off the parser logic, pre-parse the input, sanitize it and feed the Graph
only with well-formed URIs. Thanks.
OK! The n3 parser is not the cleanest code we have - so good luck! :)
@gromgull I tried (kind of) your snippet
g = Graph()
tmp_g = Graph()
tmp_g.parse(data=..., format=...)
g += [(url_fix(s), url_fix(p), url_fix(o)) for s, p, o in tmp_g]
and had:
File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 559, in __iadd__ self.addN((s, p, o, self) for s, p, o in other)
File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 402, in addN self.__store.addN((s, p, o, c) for s, p, o, c in quads
File "/usr/local/lib/python2.7/dist-packages/rdflib/store.py", line 221, in addN for s, p, o, c in quads:
File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 405, in <genexpr> and _assertnode(s,p,o)
File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 1903, in _assertnode
'Term %s must be an rdflib term' % (t,)
AssertionError: Term ff1e485ad8d8c4f4f9d092a9af2a3369cb1 must be an rdflib term
Any hint?
Just as it says - the term passed in has to be a subclass of rdflib node - try creating an URIRef from the string before returning from the fix
I ended up into:
def url_fix(u):
...
def sanitize_triple(t):
def sanitize_triple_item(item):
if isinstance(item, term.URIRef):
return term.URIRef(url_fix(str(item)))
return item
return (sanitize_triple_item(t[0]),
sanitize_triple_item(t[1]),
sanitize_triple_item(t[2]))
and everything seems to be OK. Anyway, I noticed that if I serialize the very same graph to XML
(instead of N3
), everything works fine. Why? Is it an expected behavior? Thanks.
Here's a legit URL that isn't being recognized as one by RDFlib:
http://fonts.googleapis.com/css?family=Nunito|Open%20Sans:300,400,600,700&subset=latin
From the BTC 2014 dataset.
Just another "user error" to report:
WARNING:rdflib.term:https://allevents.in/santa rita does not look like a valid URI, trying to serialize this will break.
Of course I have no control over the way developers encode their URIs (correctly or not), and the scope of correcting errors in semantic encoding is so vast as to warrant academic study (e.g. http://dl.acm.org/citation.cfm?id=2950981). But why not help everyone move in the right direction by at least attempting to auto-correct the simple and extremely common stuff like " "
-> "%20"
?
Or, if you agree this is out of the scope of this library, then how about we agree to silence this warning altogether, by moving this to the 'INFO' category of logging? Indeed, RDFLib is not meant for resolving and requesting data from URLs. That job is better suited for Requests, which indeed handles the following use case admirably:
import requests
html = requests.get(url="https://allevents.in/santa rita").content
print html
So I'm requesting we "put up or shut up". In the meantime people can silence this warning by just changing term.py line 208 from logger.warning(...)
to logger.info(...)
.
i object auto-correcting such things as long term it will introduce more errors than it solves.
Let's extend your example a bit...:
https://allevents.in/santa rita
https://allevents.in/santa%20rita?query= foo bar&bla # only query part unescaped? is the & part of the query value or a new param?
https://allevents.in/santa+rita?query= foo bar # other common " " replacement, should query part do it similar?
https://allevents.in/santa_rita?query= foo bar # wikipedia " " replacement
http://example.com/jörn # did you actually mean the IRI (UTF-8 'ö') or URI ('%C3%B6')?
...
I see that it is tempting to say "auto-correct the simple and extremely common stuff". However, we have to weigh this against providing a consistent, deterministic lib. I'm quite convinced that the way we handle this, namely expect the developer to give us correct URIs is the least problematic in the end.
Given that: warnings in early development are a good way to make a developer aware. If some developer uses invalid URIs as URIs, then they should definitely know about this. In production code without a configured logger that warning isn't shown if i'm not mistaken. The other cases are:
If we're going to object to auto-correct, I'd also object to the absolutization of URIs. Relative URIs are actually very useful when processing and managing data, and should not be automatically changed.
Since auto-correct and URI resolution are quite different (and applying the latter is in the specs), that's ought to be a distinct issue/feature request?
(IIRC, RDFLib allows you to leave off a base URI and then won't tamper with relative URIs. Not sure if that's the case for all syntax processors though, and does go beyond what the RDF specs say.)
Is there a way to skip the triples (while parsing) with such invalid URI?
@niklasl leaving the base off isn't honored at all. It will almost always add on the current working directory.
I got this:
This is the full triple:
I could observe that the triple is correctly parsed but cannot be serialized:
Please, tell me if you need more details.