Open cgueret opened 7 years ago
One way to fix it would be to switch to a library already providing support for Unicode strings. For example http://uriparser.sourceforge.net/#features . The other way is to extend the current custom code to add the support for UTF.
We already use uriparser (it's incorporated into liburi
). And indeed the first and last ones of those are pure-ASCII:
$ ./util/uriparse 'http://dbpedia.org/resource/Cardiff_Blues_vs_Leicester_Tigers_(2008–09_Heineken_Cup)'
scheme="http"
auth=''
host="dbpedia.org"
port=''
path="/resource/Cardiff_Blues_vs_Leicester_Tigers_(2008%E2%80%9309_Heineken_Cup)"
query=''
Identify the source of error 2001 in Anansi and trace it back to see why those URIs aren't being parsed.
This appears to be triggered by locale-dependent conversions; bbcarchdev/liburi#2 adds uri_create_ustr()
which we should use in place of uri_create_str()
They all result in a parsing error. These are the errors generated when adding dbpedia:Cardiff to the queue:
Tracked as RESDATA-1269