dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
855 stars 269 forks source link

Some homepages are not extracted by the MappingExtractor #123

Open ninniuz opened 10 years ago

ninniuz commented 10 years ago

There are cases in which the MappingExtractor cannot successfully extract an homepage for a resource, even if the infobox property is mapped to foaf:homepage.

This happens because some Infoboxes require editors to insert URLs as plain strings, e.g. http://en.wikipedia.org/wiki/Template%3AInfobox_writer

Most of the times the foaf:homepage is extracted by the HomepageExtractor. There are cases though, in which this is not true (e.g. 350 Writer(s) on Live DBpedia)

http://live.dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+distinct+%3Fresource+%3Fprop_website+%3Fhomepage+WHERE+%7B+%3Fresource+a+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2FWriter%3E+.+%3Fresource+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2Fwebsite%3E+%3Fprop_website+.+OPTIONAL+%7B+%3Fresource+foaf%3Ahomepage+%3Fhomepage+%7D+FILTER+%28%21BOUND%28%3Fhomepage%29%29+%7D&format=text%2Fhtml&timeout=0&debug=on

It should be possible to specify a transformation in the Infobox_writer mapping, e.g.

{{PropertyMapping | ontologyProperty = foaf:home | templateProperty = {{ExternalLink|website}} }}

which could retrieve the text content of the template property and try to cast it to an ExternalLink

ninniuz commented 10 years ago

Could be implemented with #112

ninniuz commented 10 years ago

This is not correct: the homepages are not extracted because editors did not insert a strictly valid URL, i.e. they miss the protocol. The SimpleWikiParser detects text starting with http as an ExternalLink (as Mediawiki does). Since in these cases the values do not have a starting protocol part, they are wrapped in TextNode(s). The LinkParser then discards them.

Solutions: