dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
852 stars 270 forks source link

dataprop extractor: language doesn't handle lang tag sr-Cyrl #303

Open VladimirAlexiev opened 9 years ago

VladimirAlexiev commented 9 years ago

template: http://mappings.dbpedia.org/index.php/Template:PropertyMapping says:

property: http://mappings.dbpedia.org/index.php/OntologyProperty:Foaf:name

mapping: http://mappings.dbpedia.org/index.php?title=Mapping_fr:Infobox_Ville_de_Serbie&action=edit has

{{PropertyMapping | templateProperty = nom | ontologyProperty = foaf:name | language = fr }}
{{PropertyMapping | templateProperty = nom_cyrillique | ontologyProperty = foaf:name | language = sr-Cyrl }}

wiki page: https://fr.wikipedia.org/w/index.php?title=Požega_(Serbie)&action=edit has

| nom_cyrillique           = Пожега

result: http://mappings.dbpedia.org/server/extraction/fr/extract?title=Požega_(Serbie)&revid=&format=turtle-triples&extractors=custom

Maybe the dataprop extractor has the wrong idea what can a lang tag be? That above is a valid lang tag meaning "lang=Serbian, script=Cyrillic"

VladimirAlexiev commented 9 years ago

This is critical, because we want to fix 10-15 lang-specific props to foaf:name with lang tag: http://mappings.dbpedia.org/index.php/What%27s_in_a_Name#Language-specific_Names

VladimirAlexiev commented 9 years ago

Another interesting lang tag is "qqq-DZ" (meaning "language used in specific region: Algeria") in http://mappings.dbpedia.org/index.php?title=Mapping_fr:Infobox_Commune_d'Algérie&action=edit

VladimirAlexiev commented 9 years ago

I now see http://mappings.dbpedia.org/index.php/Template:PropertyMapping says: "we can define the language tag using the wikipedia language code".

But you should accept IANA lang tags not wikipedia codes, since the lang of a wikipedia does not limit the lang strings that it can contain. Eg frwiki talks about names in Serbian cyrillic (sr-Cyrl), Gagauz (gag), Algerian (which is not a single lang, ergo qqq-DZ) etc.

jimkont commented 9 years ago

This is a nice addition but not sure what it might break in the framework. @jcsahnwaldt any ideas? There are some comments in the file [1] probably by you

[1] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/Language.scala

VladimirAlexiev commented 9 years ago

@jimregan: on first glance, we need to add to nonIsoCodes at https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/Language.scala#L100 each of the language codes we dealth with at https://github.com/dbpedia/mappings-tracker/issues/15

But I'm not sure what are these codes used for:

jimregan commented 9 years ago

Ok, well that mapping needs to go. And never be mentioned again!

Nono314 commented 9 years ago

There are at least two problems with the current system:

So, even if you set language="gag" in the mapping, it will end up in the triples with an @ tr language tag, which may not be what you expected...