dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
856 stars 269 forks source link

DBpedia mappings server cannot extract triples when topic name contains "&" #151

Closed ninniuz closed 10 years ago

ninniuz commented 10 years ago

Only one (invalid) triple is extracted from http://mappings.dbpedia.org/server/extraction/en/extract?title=Mo%C3%ABt+%26+Chandon&revid=&format=trix

jcsahnwaldt commented 10 years ago

In WikiApi.scala, we build the query to the Wikipedia API and insert the titles in the following way:

https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/WikiApi.scala

titleGroup.map(_.encodedWithNamespace).mkString("|")

encodedWithNamespace calls WikiUtil.wikiEncode, whose JavaDoc says:

https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/WikiUtil.scala

The result is usable in most parts of a IRI. The ampersand '&' is not escaped though. Should only be used for canonical MediaWiki page names. Not for fragments, not for queries.

Notice the "not for queries". :-)

A simple fix would be to replace the '&' when we're building the query in WikiApi.scala:

titleGroup.map(_.encodedWithNamespace.replace("&", "%26")).mkString("|")

I think this should work, but I'm not sure.

Sorry for any formatting errors, I'm on my phone.

ninniuz commented 10 years ago

Thanks JC! As usual a very thorough analysis :) Gonna try and fix it.

ninniuz commented 10 years ago

Maybe we should just URL encode the titles string as we did with org.dbpedia.extraction.util.WikiApi#retrievePagesByNamespace ? The normalization is correct, but the value of the query parameter (titles) is not since special chars are not escaped.

jcsahnwaldt commented 10 years ago

I think this will lead to double escaping. For example, there are Wikipedia pages for the percent sign % and the quotation mark ". Their encodedWithNamespace titles are "%25" and "%22" respectively. When we URLEncoder.encode these strings, we get "%2525" and "%2522", so our request to the MediaWiki API looks like this and fails:

https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&titles=%2525&rvprop=ids|content|timestamp|user|userid

I think we should only escape the ampersand '&'.

jcsahnwaldt commented 10 years ago

In retrievePagesByNamespace, it's OK to use URLEncoder. The fromPage parameter is only really used in the recursive call inside the method:

for(continuePage <- response \ "query-continue" \ "allpages" \ "@apfrom" headOption)
{
  retrievePagesByNamespace(namespace, f, continuePage.text)
}

We get the continuePage from the XML that the API sends us, and that's not URL-escaped. For example the response of the API call

https://en.wikipedia.org/w/api.php?action=query&format=xml&list=allpages&apfrom=%25&aplimit=10&apnamespace=0

contains the line

<allpages apcontinue="%_operator"/>

The page title "%_operator" is not percent-escaped, only spaces are replaced by underscores.

jcsahnwaldt commented 10 years ago

http://mappings.dbpedia.org/server/extraction/en/extract?title=%22Weird_Al%22_Yankovic doesn't work anymore.

jcsahnwaldt commented 10 years ago

A bit more analysis... (I'm spending way too much time on this little problem...)

Old way:

t.encodedWithNamespace

Broken - didn't encode '&'. (Also, non-ASCII characters were not encoded, which seems to work but is against HTTP specification).

Current solution:

URLEncoder.encode(t.encodedWithNamespace, "UTF-8")

Broken - titles that contain percent-encoded characters don't work anymore because they are encoded twice.

Solution I suggested earlier:

t.encodedWithNamespace.replace("&", "%26")

Probably works, but non-ASCII characters are not encoded, which seems to work but is against HTTP specification.

Maybe the following would be the best solution:

URLEncoder.encode(t.decodedWithNamespace.replace(' ', '_'), "UTF-8")

This first replaces spaces by underscores and then encodes most ASCII characters and all non-ASCII characters. That's a bit more encoding than strictly necessary, but the Wikipedia web server will decode them anyway. Replacing spaces is necessary because URLEncoder would replace them by '+', which is deprecated and would probably not be decoded properly by the Wikipedia server.

ninniuz commented 10 years ago

Thanks JC, I agree with you re

URLEncoder.encode(t.decodedWithNamespace.replace(' ', '_'), "UTF-8")

Please let me know if you want to submit a patch or I can change that.

jcsahnwaldt commented 10 years ago

Could you change it? Thanks!