Add an extractor for ImageAnnotations

gaurav commented 10 years ago

https://commons.wikimedia.org/wiki/Help:Gadget-ImageAnnotator Example image: https://commons.wikimedia.org/wiki/File:Spelterini_Bl%C3%BCemlisalp.jpg

We can use a PageNodeExtractor to find the appropriate templates, and then we take the nodes in between and convert it to text.

We can have both a StringParser (https://github.com/dbpedia/extraction-framework/blob/ce8339360355c9b3fe7c8f803e38ebb016fcd79b/core/src/main/scala/org/dbpedia/extraction/dataparser/StringParser.scala) representation as well as a raw WikiText representation, which could be run through Commons' MediaWiki API if somebody needs that translated into HTML.

gaurav commented 10 years ago

We should use the PageNode: a WikiPage can be used, but the regex would get very complicated to deal with spacing and stuff. The PageNode should be pretty straightforward.

gaurav commented 10 years ago

[x] E-mail public-lod and see if they have any suggestions on how to model this in RDF

gaurav / extraction-framework

Add an extractor for ImageAnnotations #31