dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
850 stars 270 forks source link

enable several classes per entity #341

Open VladimirAlexiev opened 9 years ago

VladimirAlexiev commented 9 years ago

Hamid Ghofrani [ghofham@gmail.com] For Elvis_Presley, the DBpedia types are just http://dbpedia.org/ontology/Agent http://dbpedia.org/ontology/MilitaryPerson http://dbpedia.org/ontology/Person

Wikipedia has this: https://en.wikipedia.org/w/index.php?title=Elvis_Presley&action=edit

{{Infobox person
| occupation   = Singer, actor
| module = {{Infobox military person
| module2 = {{Infobox musical artist
  | instrument   = Vocals, guitar, piano
   | background   = solo_singer
   | genre        = {{flat list|
*[[Rock and roll]]
*[[Pop music|pop]]
  ...

The newest extraction is here: http://mappings.dbpedia.org/server/extraction/en/extract?title=Elvis_Presley&revid=&format=turtle-triples&extractors=custom

Unfortunately DBpedia processes only the first two infboxes (Person and Military person) but not Musical artist. It even skips the instrument, background and genre fields from the third infobox (Musical artist). Gerard Kuys has remarked that DBpedia picks only one leaf class "to avoid contradictions". I can understand that various infoboxes scattered throughout the article could contribute non-sensical classes, especially if they have non-sense mappings like Mapping_el:Quote_box

However:

VladimirAlexiev commented 9 years ago

@jimkont said: One problem is that we do not process embedded templates (Infobox musical artist)which is mainly a design issue. I am not aware who made it in the past, it is quite easy to change it but not sure of the implications of such a change. (Currently it extracts neither MilitaryPerson nor MusicalArtist: both are nested).

Sometimes it helps to look at the state of the articles at the time of extraction http://en.wikipedia.org/w/index.php?title=Elvis_Presley&action=edit&oldid=606258011 DBpedia assigns a single type for each resource and creates separate ones for subsequent mapped templates if they are not direct subclasses/superclasses of the first mapped template in this case we had an infobox Person followed by a infobox military person (not nested)

VladimirAlexiev commented 9 years ago

The same problem "do not process embedded templates" causes https://github.com/dbpedia/mappings-tracker/issues/46

VladimirAlexiev commented 9 years ago

http://sourceforge.net/p/dbpedia/mailman/message/32867924/

@jimkont said: it is very trivial to change but needs testing... any volunteers from the community? I can provide an adapted version of the code and also dumps but someone needs to look at the data

Sure: Boyan can deploy it locally, and I’ll look at the data. Gimme test cases, so far I got:

Is the logic "pick one out of several disjoint classes" documented precisely somewhere? And use cases/test cases? @jcsahnwaldt?

I don't know but I have an uneasy feeling about such logic. If templateA says classA and templateB says classB, seems to me the extractor itself can't make an intelligent decision to drop one of them.

jcsahnwaldt commented 9 years ago

I don't think this feature is properly documented or tested, but the comments are pretty good:

https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/TemplateMapping.scala#L43

jcsahnwaldt commented 9 years ago

It might seem reasonable to ascribe the types of all infoboxes to the main resource, but one prominent counterexample given in the comments is https://en.wikipedia.org/wiki/Volkswagen_Golf - lots of infoboxes describing specific Golf models. Attaching all their data to the main resource wouldn't be useful.

jimkont commented 9 years ago

The code that extracts all templates from a page is in this branch https://github.com/jimkont/extraction-framework/tree/multi-template-mapping

I did some experiments in Dutch diffs: https://www.dropbox.com/sh/3gjfrou29lmgxad/AAB41KYHJyTSCu9jnbu4LmjVa?dl=0

triple stats:

   8361422 nlwiki-20141209-instance-types.ttl.all
   8469742 nlwiki-20141209-instance-types.ttl.top
  12694261 nlwiki-20141209-mappingbased-properties.ttl.all
  12717493 nlwiki-20141209-mappingbased-properties.ttl.top

From a superficial look it mostly adds types to untyped resources due to the following mapping http://mappings.dbpedia.org/index.php/Mapping_nl:Bronvermelding_anderstalige_Wikipedia these types look wrong but needs some further investigation if by removing this mapping things get improved.

It would be nice to test this in other languages and English

VladimirAlexiev commented 9 years ago

I see, that uses correspondingProperty & correspondingClass.

@jcsahnwaldt When there's explicit correspondingClass, the "pick one out of several disjoint classes" logic does not apply. But I'll read those source comments...

VladimirAlexiev commented 9 years ago

@jimkont "Bronvermelding anderstalige Wikipedia" means "Sources in other-language Wikipedias", eg

* {{Bronvermelding anderstalige Wikipedia|taal=de|titel=Archimedes|datum=20140414}}
* {{Bronvermelding anderstalige Wikipedia|taal=en|titel=Archimedes|datum=20140414}}

So these are stale (non-Wikidata) Interlanguage links. Quick killing is recommended.

@boyan-simeonov: can you please install https://github.com/jimkont/extraction-framework/tree/multi-template-mapping locally so I can test it?

VladimirAlexiev commented 9 years ago
jimkont commented 9 years ago

@roland-c can you check the dutch diffs for possible errors?

frankgeerlings commented 9 years ago

The {{Bronvermelding anderstalige Wikipedia}} template should not be read as being an interlanguage link. It is there to comply with the CC-BY-SA license of the source material. It's probably more appropriate to compare it to, say, {{Cite web}}. I don't see any mappings for that, so it's probably inappropriate to have one for Bronvermelding_anderstalige_Wikipedia.

roland-c commented 9 years ago

The diffs are full of errors (dbpo:Article) because of http://mappings.dbpedia.org/index.php/Mapping_nl:Bronvermelding_anderstalige_Wikipedia, which is now removed. A new diff including Elvis would realy help to verify correct results.