Closed ninniuz closed 10 years ago
In WikiApi.scala, we build the query to the Wikipedia API and insert the titles in the following way:
titleGroup.map(_.encodedWithNamespace).mkString("|")
encodedWithNamespace calls WikiUtil.wikiEncode, whose JavaDoc says:
The result is usable in most parts of a IRI. The ampersand '&' is not escaped though. Should only be used for canonical MediaWiki page names. Not for fragments, not for queries.
Notice the "not for queries". :-)
A simple fix would be to replace the '&' when we're building the query in WikiApi.scala:
titleGroup.map(_.encodedWithNamespace.replace("&", "%26")).mkString("|")
I think this should work, but I'm not sure.
Sorry for any formatting errors, I'm on my phone.
Thanks JC! As usual a very thorough analysis :) Gonna try and fix it.
Maybe we should just URL encode the titles string as we did with org.dbpedia.extraction.util.WikiApi#retrievePagesByNamespace ? The normalization is correct, but the value of the query parameter (titles) is not since special chars are not escaped.
I think this will lead to double escaping. For example, there are Wikipedia pages for the percent sign % and the quotation mark ". Their encodedWithNamespace titles are "%25" and "%22" respectively. When we URLEncoder.encode these strings, we get "%2525" and "%2522", so our request to the MediaWiki API looks like this and fails:
I think we should only escape the ampersand '&'.
In retrievePagesByNamespace, it's OK to use URLEncoder. The fromPage parameter is only really used in the recursive call inside the method:
for(continuePage <- response \ "query-continue" \ "allpages" \ "@apfrom" headOption)
{
retrievePagesByNamespace(namespace, f, continuePage.text)
}
We get the continuePage from the XML that the API sends us, and that's not URL-escaped. For example the response of the API call
contains the line
<allpages apcontinue="%_operator"/>
The page title "%_operator" is not percent-escaped, only spaces are replaced by underscores.
http://mappings.dbpedia.org/server/extraction/en/extract?title=%22Weird_Al%22_Yankovic doesn't work anymore.
A bit more analysis... (I'm spending way too much time on this little problem...)
Old way:
t.encodedWithNamespace
Broken - didn't encode '&'. (Also, non-ASCII characters were not encoded, which seems to work but is against HTTP specification).
Current solution:
URLEncoder.encode(t.encodedWithNamespace, "UTF-8")
Broken - titles that contain percent-encoded characters don't work anymore because they are encoded twice.
Solution I suggested earlier:
t.encodedWithNamespace.replace("&", "%26")
Probably works, but non-ASCII characters are not encoded, which seems to work but is against HTTP specification.
Maybe the following would be the best solution:
URLEncoder.encode(t.decodedWithNamespace.replace(' ', '_'), "UTF-8")
This first replaces spaces by underscores and then encodes most ASCII characters and all non-ASCII characters. That's a bit more encoding than strictly necessary, but the Wikipedia web server will decode them anyway. Replacing spaces is necessary because URLEncoder would replace them by '+', which is deprecated and would probably not be decoded properly by the Wikipedia server.
Thanks JC, I agree with you re
URLEncoder.encode(t.decodedWithNamespace.replace(' ', '_'), "UTF-8")
Please let me know if you want to submit a patch or I can change that.
Could you change it? Thanks!
Only one (invalid) triple is extracted from http://mappings.dbpedia.org/server/extraction/en/extract?title=Mo%C3%ABt+%26+Chandon&revid=&format=trix