dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
853 stars 269 forks source link

some cats are not skos:Concept #385

Open VladimirAlexiev opened 9 years ago

VladimirAlexiev commented 9 years ago

Some cats on dbpedia.org don't have rdf:type skos:Concept. I discovered this while investigating skos:related. Eg try this query

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
select * {?x skos:related ?y.
  #filter not exists{?x a skos:Concept}.
  filter not exists{?y a skos:Concept}
} limit 200
jimkont commented 9 years ago

We need a close look at few examples but ?y is extracted from ?x and ?y page might not exist at the extraction time

VladimirAlexiev commented 9 years ago

Guess you're right. Added more restrictions and it now returns only a few (50ish)

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
select * {?x skos:related ?y.
  filter (not exists{?x a skos:Concept} && exists {?x ?px ?x1} ||
          not exists{?y a skos:Concept} && exists {?y ?py ?y1})
} 
jimkont commented 9 years ago

For example the following category is a deleted page in Wikipedia so I guess this is due to Wikipedia https://en.wikipedia.org/wiki/Category:Fictional close?

VladimirAlexiev commented 9 years ago

But how about eg these results of my query: http://dbpedia.org/resource/Category:Wars_involving_the_Republic_of_China, http://dbpedia.org/resource/Category:Wars_involving_Qing_Dynasty http://dbpedia.org/resource/Category:Wars_involving_the_Ming_dynasty, http://dbpedia.org/resource/Category:Wars_involving_Qing_Dynasty

I checked the latter: has only sameAs and skos:related.

I'd guess the reason for "page might not exist at the extraction time" is:

Dangling references (redlink categories) are not supposed to exist. There's an editorial policy: a category should first be created before being referenced. So these are editorial problems in Wikipedia. (That applies to both queries, not just the second one).

The extractor makes useful triples for some redlink articles (eg Person, Football_player, name from a team roster even for non-existing players in the roster).

But non-existing categories should be reported to Wikipedia, rather than producing triples (which are not useful). The question is whether you have a channel for reporting problems to Wikipedia...

jimkont commented 9 years ago

We don't but we want to create one. This would be a good start I guess. @ enyone has any ideas how to bootstrap this. Otherwise we can post this on the dbpedia-discussion / wikidata list

jimkont commented 9 years ago

@pigsonthewing can you suggest a way to report this to Wikipedia?