dbpedia / mappings-tracker

This project is used for tracking mapping issues in mappings.dbpedia.org
9 stars 6 forks source link

dbo:Country is full of non-country entities #72

Closed svick closed 8 years ago

svick commented 8 years ago

If you look at the list of entities of type dbo:Country, it contains many entries that are not countries and should not have this type. Among them:

For several of those, I believe the reason they are included there is that they are dbo:Country of some other entity (e.g. dbr:Jonathan_October's dbo:Country is dbr:Finland_national_cricket_team), but I'm not sure what is the cause of that.

jimkont commented 8 years ago

Looking at this looks like all the additional dbo:Country definitions orriginate from the SDTypes dataset by @HeikoPaulheim it contains 372 Country statements e.g.

<http://dbpedia.org/resource/Cinema_of_Israel> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/LEN> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Ava_Kingdom> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Venezuela_nationality_law> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Argentine_nationality_law> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Veigue_(Santa_Comba)> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Cinema_of_South_Africa> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Fengtian_clique> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Musqueam_Indian_Band> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Continental_Congress> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Seleucid_Empire> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/United_States_Lighthouse_Board> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Television_in_Australia> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Serbian_nationality_law> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/List_of_British_light-heavyweight_boxing_champions> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Italian_nationality_law> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/United_States_Ambassador_to_Lebanon> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Norwegian_nationality_law> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Voortrekkers> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/British_subject> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Nordic_peoples> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Nigeria_national_cricket_team> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/European_Volleyball_Confederation> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Frankfurt_Parliament> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Greater_S%C3%A3o_Paulo> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Federales_(Argentina)> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .
<http://dbpedia.org/resource/Cinema_of_Switzerland> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Country> .

@HeikoPaulheim, any comments?

HeikoPaulheim commented 8 years ago

Hi Dimitris & all,

just to explain what is happening, we use heuristics to find missing types, and as it is with heuristics, they may or may not do the right thing.

Just to pick the first example: there are two statements involved, i.e., dbr:StrawberryFields(2006_film) dbo:country dbr:Cinema_of_Isreal . dbr:Trembling_Before_G-d dbo:country dbr:Cinema_of_Isreal . Now, dbo:country has rdfs:range dbo:Country, so it is even a correct entailment. We do not use RDFS reasoning, but a softer kind of inference, but this is, although more tolerant to noise, still following the Garbage-In-Garbage-Out principle.

I just looked through the first examples in the list below, you always find a similar cause there.

Hth.

Cheers, Heiko

Am 14.10.2015 um 04:26 schrieb Dimitris Kontokostas:

Looking at this looks like all the additional dbo:Country definitions orriginate from the SDTypes dataset by @HeikoPaulheim https://github.com/HeikoPaulheim it contains 372 Country statements e.g.

|http://dbpedia.org/resource/Cinema_of_Israel http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/LEN http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Ava_Kingdom http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Venezuela_nationality_law http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Argentine_nationality_law http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Veigue_(Santa_Comba) http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Cinema_of_South_Africa http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Fengtian_clique http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Musqueam_Indian_Band http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Continental_Congress http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Seleucid_Empire http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/United_States_Lighthouse_Board http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Television_in_Australia http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Serbian_nationality_law http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/List_of_British_light-heavyweight_boxing_champions http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Italian_nationality_law http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/United_States_Ambassador_to_Lebanon http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Norwegian_nationality_law http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Voortrekkers http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/British_subject http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Nordic_peoples http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Nigeria_national_cricket_team http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/European_Volleyball_Confederation http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Frankfurt_Parliament http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Greater_S%C3%A3o_Paulo http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Federales_(Argentina) http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . http://dbpedia.org/resource/Cinema_of_Switzerland http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dbpedia.org/ontology/Country . |

@HeikoPaulheim https://github.com/HeikoPaulheim, any comments?

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/mappings-tracker/issues/72#issuecomment-147972386.

Prof. Dr. Heiko Paulheim Data and Web Science Group University of Mannheim Phone: +49 621 181 2646 B6, 26, Room C1.08 D-68159 Mannheim

Mail: heiko@informatik.uni-mannheim.de Web: www.heikopaulheim.com

svick commented 8 years ago

Looking at the infobox of Strawberry Fields (2006 film), it contains this line:

| country        = [[Cinema of Israel|Israel]]

My guess is that a lot of the Garbage-In is caused by similar entries. What can be done to fix this?

HeikoPaulheim commented 8 years ago

Hi Petr,

Well, as I said, it's heuristics. We've tried to find a good trade off between coverage and precision, so they are currently tuned towards finding as many statements as possible without falling below 95% precision across all classes.

I hvae some feelings about how to reduce some of the noise in the first place, which I'll discuss with Dimitris.

The heuristics are described in [1], and I'm happy to take suggestions on how to improve them.

Cheers, Heiko

[1] http://www.heikopaulheim.com/docs/ijswis_2014.pdf

Am 14.10.2015 um 07:44 schrieb Petr Onderka:

Looking at the infobox of Strawberry Fields (2006 film) https://en.wikipedia.org/w/index.php?title=Strawberry_Fields_%282006_film%29&action=edit&section=0, it contains this line:

|| country = [[Cinema of Israel|Israel]] |

My guess is that a lot of the Garbage-In is caused by similar entries. What can be done to fix this?

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/mappings-tracker/issues/72#issuecomment-148025693.

Prof. Dr. Heiko Paulheim Data and Web Science Group University of Mannheim Phone: +49 621 181 2646 B6, 26, Room C1.08 D-68159 Mannheim

Mail: heiko@informatik.uni-mannheim.de Web: www.heikopaulheim.com

svick commented 8 years ago

I think that the heuristic that sees dbr:Strawberry_Fields_(2006_film) dbo:country dbr:Cinema_of_Isreal and generates dbr:Cinema_of_Isreal rdf:type dbo:Country is fine. It's the code that sees | country = [[Cinema of Israel|Israel]] and generates dbr:Strawberry_Fields_(2006_film) dbo:country dbr:Cinema_of_Isreal that should be fixed.

But I don't know enough about DBpedia to know what is the right way to fix that.

jimkont commented 8 years ago

The code takes what it can from Wikipedia https://en.wikipedia.org/w/index.php?title=Strawberry_Fields_(2006_film)&action=edit

if you look at the definition of the template people should add the country in that field. Wikipedia editors were careless and added a Cinema instead https://en.wikipedia.org/wiki/Template:Infobox_film

What DBpedia does is create mappings according to the template definitions that generate dbo:country from country http://mappings.dbpedia.org/index.php/Mapping_en:Infobox_film

So this error originates from Wikipedia and Heiko's algorithm should be probably tuned for a higher precision to avoid such errors in following releases

VladimirAlexiev commented 8 years ago

@HeikoPaulheim: it's a known problem that the object extractor doesn't respect ranges: http://vladimiralexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html#sec-7-3. This problem is very hard to fix.

It's also a known fact that RDFS reasoning over DBpedia is disastrous, since domains/ranges at present are wishful thinking.

"GIGO" is not an excuse to create more garbage. Wise people don't use RDFS with DBpedia. As jimkont says, "Heiko's algorithm should be probably tuned for a higher precision": otherwise wise people will stop loading the Heuristic Types either.

@kidehen, I vote that until this is fixed, dbpedia.org should stop loading the Heuristic Types.

@jimkont, please reopen the issue.

kidehen commented 8 years ago

@VladimirAlexiev -- Problems like this are solved by loading questionable data so dedicated named graphs rather than http://dbpedia.org. That separation has no impact on the default browser pages, but does enable clients exclude said named graphs in queries that might be performing reasoning that requires high precision etc..

Go forward, there can be dedicated named graphs to TBox triples. Example:

http://dbpedia.org/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fontology%2FLocation&tp=2&sid=20517&graph=http%3A%2F%2Fdbpedia.org%2Fresource%2Fclasses%23 -- which is scoped to http://dbpedia.org/resource/classes#

VladimirAlexiev commented 8 years ago

@kidehen I vote that the default pages and queries do not show dbo:Country types for thousands of entities that are not countries. I bet that 99.99% of dbpedia clients don't use named graphs in their queries.

I find it kind of ironic that ref [1] cited above is titled "Improving the Quality of Linked Data Using Statistical Distributions" :-)

jimkont commented 8 years ago

I think @kidehen is saying to put sdtypes on a separate graph that will have to be explicitly specified like with the pagerank statements, right Kingsley?

I am in favor of this approach if you all agree and in this case we can also include the DBTax statements by @marfox the same way

kidehen commented 8 years ago

On 6/10/16 3:15 AM, Dimitris Kontokostas wrote:

I think @kidehen https://github.com/kidehen is saying to put sdtypes on a separate graph that will have to be explicitly specified like with the pagerank statements, right Kingsley?

I am in favor of this approach if you all agree and in this case we can also include the DBTax statements by @marfox https://github.com/marfox the same way

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dbpedia/mappings-tracker/issues/72#issuecomment-225110258, or mute the thread https://github.com/notifications/unsubscribe/ABeGcSL5FjT5o20Plvle_89wSk5CfpKJks5qKQ75gaJpZM4GMegv.

Yep!

Regards,

Kingsley Idehen
Founder & CEO OpenLink Software
Company Web: http://www.openlinksw.com Personal Weblog 1: http://kidehen.blogspot.com Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen Twitter Profile: https://twitter.com/kidehen Google+ Profile: https://plus.google.com/+KingsleyIdehen/about LinkedIn Profile: http://www.linkedin.com/in/kidehen Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this

marfox commented 8 years ago

Totally agree, that's exactly what I'm doing with the fact extractor datasets in the Italian endpoint.