dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
851 stars 270 forks source link

subClassOf is not expanded for all Places #414

Closed VladimirAlexiev closed 8 years ago

VladimirAlexiev commented 9 years ago

Querying http://live.dbpedia.org/sparql.

  1. A bunch of CelestialBodies don't have type dbo:Place. But http://live.dbpedia.org/ontology/CelestialBody is a subclass of dbo:Place.
prefix dbo: <http://dbpedia.org/ontology/>
select * {
  ?x a ?type. ?type rdfs:subClassOf+ dbo:Place
  filter not exists {?x a dbo:Place}
} limit 1000
  1. There are 3 places other than dbo:CelestialBody that also don't have type dbo:Place:
prefix dbo: <http://dbpedia.org/ontology/>
select * {
  ?x a ?type. ?type rdfs:subClassOf+ dbo:Place
  filter (?type != dbo:CelestialBody)
  filter not exists {?x a dbo:Place}
} limit 1000

x   type
http://dbpedia.org/resource/Beth_Israel_Medical_Center  http://dbpedia.org/ontology/ArchitecturalStructure
http://dbpedia.org/resource/Holy_Trinity_Orthodox_Seminary  http://dbpedia.org/ontology/ArchitecturalStructure
http://dbpedia.org/resource/The_Cable_Building_(New_York_City)  http://dbpedia.org/ontology/HistoricPlace
jimkont commented 9 years ago

the ontology snapshot for 2015-04 release does not contain this relation http://dbpedia.org/ontology/CelestialBody so for the static release it is fine.

For DBpedia Live the rolling of the update should be finished in a few days / weeks. Subclass relations are hard to feed into live so we rely on the unmodified feeder (checking articles not extracted for iirc 60 days) to complete the change

VladimirAlexiev commented 9 years ago

If you run the last query on http://dbpedia.org/sparql, it's worse: over 1000 places, including River, Stream, BodyOfWater, Country, PopulatedPlace, Settlement.

HeikoPaulheim commented 9 years ago

The latter two come from the heuristic typing method SDType, which uses ingoing properties to type instances. Both examples mentioned have ingoing relations of type almaMater, which tell the typing algorithm that they should be universities.

VladimirAlexiev commented 9 years ago

@HeikoPaulheim Then Heuristic Typing (partilly?) implements rdfs:range, which is detrimental. Why: because wikipedia editors put all kind of shit in template fields, so most DBO ranges are wishful thinking. Before adding, Heuristic Typing must check whether the target already has some types, and assume that sibling types are disjoint.

If you don't do this, you'll wreak havoc by inferring that various countries are persons and vice versa. You'll also infer that these are people: Archbishop, Corfu, All My Children, Adoption, Kajang, Prehistory See http://vladimiralexiev.github.io/pres/20150209-dbpedia/dbpedia-problems.html#/sec-7-3.

HeikoPaulheim commented 9 years ago

@VladimirAlexiev SDType learns type distributions as they are actually deployed in Dbpediy, not as they are defined in as rdfs:range in the ontology. In the almaMater example I gave, university is assumed because the vast majority of objects of that property have that type, not because it's defined in the ontology.

Furthermore, on DBpedia, we only apply it to instances that do not have any types before, so no inconsistencies are introduced for those entities.

VladimirAlexiev commented 9 years ago

@HeikoPaulheim

  1. To infer type X for resource Y simply because Y is the target of a property P that is often used to target X, seems an awfully risky strategy to me.
  2. "only apply it to instances that do not have any types before": then who inferred dbo:PopulatedPlace for http://dbpedia.org/resource/FC_Minsk? It already has dbo:Agent, dbo:Organisation, dbo:SoccerClub, dbo:SportsTeam and a bunch of other types. (see this query for a lot more bad examples)
  3. If SDType infers several types for a resource, which ones do you emit?
HeikoPaulheim commented 9 years ago

@VladimirAlexiev re 1: it has been shown empirically that this approach, combined with some post processing, gives reasonable results for many use cases. The approach deployed for DBpedia is configured for achieving 95% precision. re 2: the instance at hand actually did not have types before, see [1]. re 3: we apply a confidence threshold, but no consistency checking of the solution is used.

[1] http://downloads.dbpedia.org/preview.php?file=2015-04_sl_core-i18n_sl_en_sl_instance-types_en.nt.bz2

VladimirAlexiev commented 9 years ago

re 2. It uses "Football club infobox" which is mapped http://mappings.dbpedia.org/index.php/Mapping_en:Infobox_football_club, so it does have types. See http://mappings.dbpedia.org/server/extraction/en/extract?title=FC_Minsk&format=turtle-triples. The URL you gave is a very small sampling.

HeikoPaulheim commented 9 years ago

Sorry, it was meant to be [1]. Still, in that file, there's no types for the instance, which is why SDType attempts to type it.

[1] http://downloads.dbpedia.org/2015-04/core-i18n/en/instance-types_en.nt.bz2

VladimirAlexiev commented 9 years ago

re 2: Right, instance-types_en.ttl doesn't define any types for FC_Minsk. (It has however a type for an IintermediateNode related to it:

<http://dbpedia.org/resource/FC_Minsk-2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/SoccerClub>

@jimkont: this is strange because even the oldid page version that current dbpedia.org data was extracted from has the Football club infobox template. That template has mapped to the appropriate class practically forever

SDTyped defines these for FC_Minsk:

<http://dbpedia.org/ontology/Organisation> .                           
<http://dbpedia.org/ontology/SoccerClub> .                             
<http://www.wikidata.org/entity/Q486972> .                             
<http://www.w3.org/2002/07/owl#Thing> .                                
<http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#Agent> .        
<http://dbpedia.org/ontology/Agent> .                                  
<http://www.wikidata.org/entity/Q43229> .                              
<http://dbpedia.org/ontology/PopulatedPlace> .                         
<http://schema.org/SportsTeam> .                                       
<http://dbpedia.org/ontology/SportsTeam> .                             
<http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#SocialPerson> . 
<http://schema.org/Organization> .                                     

re 3. It would be really nice to implement some disjointness.

re 1. I evaluated the first 11 items in SDTyped

1 wrong, 3 somewhat incomplete, 8 perfect. Cheers!