dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
853 stars 269 forks source link

specific sources like MLCC; titles of books vs royal titles #519

Open VladimirAlexiev opened 7 years ago

VladimirAlexiev commented 7 years ago

(Split from #516) http://dbpedia.org/page/Ivan_Asen_II_of_Bulgaria has dbp:title and dbp:titleDate extracted from a MLCC template far down in References (but not from cite book templates):

* {{MLCC 
 | warning = 1
 | title-date=30 January 2011
 | title = Bulgaria: Ivan Asen II 1218–1241, Koloman I 1241–1246, Mihail II Asen 1246–1257

This is correctly not mapped to dbo:title (which is the royal title Tsar of Bulgaria) but is in itself useless.

MLCC has cat https://en.wikipedia.org/wiki/Category:Specific-source_templates and transcludes https://en.wikipedia.org/wiki/Template:Citation.

https://en.wikipedia.org/wiki/Category:Specific-source_templates lists about 600 templates that should be harvested. Each of them prefills some fields (eg MLCC prefills author "Charles_Cawley")

chile12 commented 7 years ago

I know that citations need extra mappings to be handled correctly, which is a thing not existing currently. Anyway I'm not sure what do with this issue, since it neither has a clear problem statement and seems to have no tentative relation to an extraction-framework issue. I would suggest bringing this up as a mapping task after RML is introduced. I am very interested in extracting citations mapping based.