gaurav / extraction-framework

The software used to extract structured data from Wikipedia
1 stars 0 forks source link

Add an extractor for ImageAnnotations #31

Closed gaurav closed 10 years ago

gaurav commented 10 years ago

https://commons.wikimedia.org/wiki/Help:Gadget-ImageAnnotator Example image: https://commons.wikimedia.org/wiki/File:Spelterini_Bl%C3%BCemlisalp.jpg

We can use a PageNodeExtractor to find the appropriate templates, and then we take the nodes in between and convert it to text.

We can have both a StringParser (https://github.com/dbpedia/extraction-framework/blob/ce8339360355c9b3fe7c8f803e38ebb016fcd79b/core/src/main/scala/org/dbpedia/extraction/dataparser/StringParser.scala) representation as well as a raw WikiText representation, which could be run through Commons' MediaWiki API if somebody needs that translated into HTML.

gaurav commented 10 years ago

We should use the PageNode: a WikiPage can be used, but the regex would get very complicated to deal with spacing and stuff. The PageNode should be pretty straightforward.

gaurav commented 10 years ago