dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
850 stars 270 forks source link

SKOS Category extracted produces some weird triples #711

Open kurzum opened 3 years ago

kurzum commented 3 years ago

Issue validity

Some explanation: DBpedia Snapshot is produced every three months, see Release Frequency & Schedule, which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. Berlin or Joe_Biden here: http://dief.tools.dbpedia.org/server/extraction/en/ If the issue persists, please post the link from your browser here:

http://dief.tools.dbpedia.org/server/extraction/en/extract?title=Category%3APininfarina&revid=&format=trix&extractors=custom

Error Description

Please state the nature of your technical emergency:

See title,

Pinpointing the source of the error

Where did you find the data issue? Non-exhaustive options are:

Should be one of these:

Also I assume that error is caused by these line on Wikipedia (https://en.wikipedia.org/wiki/Category:Pininfarina)

{{commonscat|Pininfarina}}
{{Cat main|Pininfarina}}

Details

please post the details

Wrong triples RDF snippet


http://dbpedia.org/resource/Category:Pininfarina | http://purl.org/dc/terms/subject | http://dbpedia.org/resource/Pininfarina 
-- | -- | -- | --
http://dbpedia.org/resource/Pininfarina | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.w3.org/2004/02/skos/core#Concept |
> Expected / corrected RDF outcome snippet 
1. **remove** the triple starting with http://dbpedia.org/resource/Pininfarina . It is easier, if all extractors just produce triples with the page as subject. 

http://dbpedia.org/resource/Pininfarina | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.w3.org/2004/02/skos/core#Concept

2. use custom property for linking `Category:` to main article, because `dct:subject` is definitely the wrong one, i.e. wrong direction and underspecified semantics. I created `dbo:mainArticleForCategory` http://mappings.dbpedia.org/index.php/OntologyProperty:MainArticleForCategory  for this

http://dbpedia.org/resource/Category:Pininfarina dbo:mainArticleForCategory http://dbpedia.org/resource/Pininfarina


>Example DBpedia resource URL(s)

> Other
jlareck commented 3 years ago

This data error was in the TopicalConceptsExtractor and so I removed extraction of triples like:

http://dbpedia.org/resource/Pininfarina | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.w3.org/2004/02/skos/core#Concept 

and changed dct:subject to dbo: mainArticleForCategory in one of parts of this extractor

kurzum commented 2 years ago

TODO: