bbcarchdev / anansi

A Linked Open Data Web crawler
https://bbcarchdev.github.io/anansi/
Apache License 2.0
0 stars 0 forks source link

URI parsing failures #62

Open cgueret opened 7 years ago

cgueret commented 7 years ago

They all result in a parsing error. These are the errors generated when adding dbpedia:Cardiff to the queue:

crawld[1]: %ANANSI-E-2001: failed to parse URI <http://dbpedia.org/resource/Cardiff_Blues_vs_Leicester_Tigers_(2008–09_Heineken_Cup)>
crawld[1]: %ANANSI-E-2001: failed to parse URI <http://ja.dbpedia.org/resource/カーディフ>
crawld[1]: %ANANSI-E-2001: failed to parse URI <http://ko.dbpedia.org/resource/카디프>
crawld[1]: %ANANSI-E-2001: failed to parse URI <http://el.dbpedia.org/resource/Κάρντιφ>
crawld[1]: %ANANSI-E-2001: failed to parse URI <http://dbpedia.org/resource/Tŷ_Pont_Haearn>
crawld[1]: %ANANSI-E-2001: failed to parse URI <http://dbpedia.org/resource/6/6/00_–_Cardiff,_Wales>

Tracked as RESDATA-1269

cgueret commented 7 years ago

One way to fix it would be to switch to a library already providing support for Unicode strings. For example http://uriparser.sourceforge.net/#features . The other way is to extend the current custom code to add the support for UTF.

nevali commented 7 years ago

We already use uriparser (it's incorporated into liburi). And indeed the first and last ones of those are pure-ASCII:

$ ./util/uriparse 'http://dbpedia.org/resource/Cardiff_Blues_vs_Leicester_Tigers_(2008–09_Heineken_Cup)'
scheme="http"
auth=''
host="dbpedia.org"
port=''
path="/resource/Cardiff_Blues_vs_Leicester_Tigers_(2008%E2%80%9309_Heineken_Cup)"
query=''
nevali commented 7 years ago

Identify the source of error 2001 in Anansi and trace it back to see why those URIs aren't being parsed.

nevali commented 6 years ago

This appears to be triggered by locale-dependent conversions; bbcarchdev/liburi#2 adds uri_create_ustr() which we should use in place of uri_create_str()