Open VladimirAlexiev opened 9 years ago
@jimkont said: One problem is that we do not process embedded templates (Infobox musical artist)which is mainly a design issue. I am not aware who made it in the past, it is quite easy to change it but not sure of the implications of such a change. (Currently it extracts neither MilitaryPerson nor MusicalArtist: both are nested).
Sometimes it helps to look at the state of the articles at the time of extraction http://en.wikipedia.org/w/index.php?title=Elvis_Presley&action=edit&oldid=606258011 DBpedia assigns a single type for each resource and creates separate ones for subsequent mapped templates if they are not direct subclasses/superclasses of the first mapped template in this case we had an infobox Person followed by a infobox military person (not nested)
The same problem "do not process embedded templates" causes https://github.com/dbpedia/mappings-tracker/issues/46
http://sourceforge.net/p/dbpedia/mailman/message/32867924/
@jimkont said: it is very trivial to change but needs testing... any volunteers from the community? I can provide an adapted version of the code and also dumps but someone needs to look at the data
Sure: Boyan can deploy it locally, and I’ll look at the data. Gimme test cases, so far I got:
Is the logic "pick one out of several disjoint classes" documented precisely somewhere? And use cases/test cases? @jcsahnwaldt?
I don't know but I have an uneasy feeling about such logic. If templateA says classA and templateB says classB, seems to me the extractor itself can't make an intelligent decision to drop one of them.
I don't think this feature is properly documented or tested, but the comments are pretty good:
It might seem reasonable to ascribe the types of all infoboxes to the main resource, but one prominent counterexample given in the comments is https://en.wikipedia.org/wiki/Volkswagen_Golf - lots of infoboxes describing specific Golf models. Attaching all their data to the main resource wouldn't be useful.
The code that extracts all templates from a page is in this branch https://github.com/jimkont/extraction-framework/tree/multi-template-mapping
I did some experiments in Dutch diffs: https://www.dropbox.com/sh/3gjfrou29lmgxad/AAB41KYHJyTSCu9jnbu4LmjVa?dl=0
triple stats:
8361422 nlwiki-20141209-instance-types.ttl.all
8469742 nlwiki-20141209-instance-types.ttl.top
12694261 nlwiki-20141209-mappingbased-properties.ttl.all
12717493 nlwiki-20141209-mappingbased-properties.ttl.top
From a superficial look it mostly adds types to untyped resources due to the following mapping http://mappings.dbpedia.org/index.php/Mapping_nl:Bronvermelding_anderstalige_Wikipedia these types look wrong but needs some further investigation if by removing this mapping things get improved.
It would be nice to test this in other languages and English
I see, that uses correspondingProperty & correspondingClass.
@jcsahnwaldt When there's explicit correspondingClass, the "pick one out of several disjoint classes" logic does not apply. But I'll read those source comments...
@jimkont "Bronvermelding anderstalige Wikipedia" means "Sources in other-language Wikipedias", eg
* {{Bronvermelding anderstalige Wikipedia|taal=de|titel=Archimedes|datum=20140414}}
* {{Bronvermelding anderstalige Wikipedia|taal=en|titel=Archimedes|datum=20140414}}
So these are stale (non-Wikidata) Interlanguage links. Quick killing is recommended.
@boyan-simeonov: can you please install https://github.com/jimkont/extraction-framework/tree/multi-template-mapping locally so I can test it?
@roland-c can you check the dutch diffs for possible errors?
The {{Bronvermelding anderstalige Wikipedia}} template should not be read as being an interlanguage link. It is there to comply with the CC-BY-SA license of the source material. It's probably more appropriate to compare it to, say, {{Cite web}}. I don't see any mappings for that, so it's probably inappropriate to have one for Bronvermelding_anderstalige_Wikipedia.
The diffs are full of errors (dbpo:Article) because of http://mappings.dbpedia.org/index.php/Mapping_nl:Bronvermelding_anderstalige_Wikipedia, which is now removed. A new diff including Elvis would realy help to verify correct results.
Wikipedia has this: https://en.wikipedia.org/w/index.php?title=Elvis_Presley&action=edit
The newest extraction is here: http://mappings.dbpedia.org/server/extraction/en/extract?title=Elvis_Presley&revid=&format=turtle-triples&extractors=custom
Unfortunately DBpedia processes only the first two infboxes (Person and Military person) but not Musical artist. It even skips the instrument, background and genre fields from the third infobox (Musical artist). Gerard Kuys has remarked that DBpedia picks only one leaf class "to avoid contradictions". I can understand that various infoboxes scattered throughout the article could contribute non-sensical classes, especially if they have non-sense mappings like Mapping_el:Quote_box
However: