facebookresearch / KILT

Library for Knowledge Intensive Language Tasks
MIT License
910 stars 91 forks source link

mapping wikidata to wikipedia #43

Open PaulLerner opened 3 years ago

PaulLerner commented 3 years ago

the issue

Hi,

Several thousands (11020 exactly) of wikipedia articles in the KILT knowledge source point to the same Wikidata item (often one of the articles is a disambiguation page).
However, in all of the examples I’ve tried, both articles have distinct Wikidata items (when following the 'Wikidata item' link on their page)
For example https://en.wikipedia.org/wiki/Ambale,_Chamarajanagar links to https://www.wikidata.org/wiki/Q4740999 and https://en.wikipedia.org/wiki/Ambale links to https://www.wikidata.org/wiki/Q48441930 (but in KILT both link to Q4740999) Perhaps it is just a coincidence and these articles have been updated after you collected the data but I wondered how did you do the mapping between wikipedia and wikidata ?

Follow-up question: there is no indication in KILT that the wikipedia article is a disambiguation page or not, is it ?

disclaimer

I use HF version of KILT but I seriously doubt that it is the cause of the issue

more examples of 1-many mappings in KILT

formatted like QID and linked wikipedia articles in KILT

Q5584686 ['Gordi', 'Gordi (band)']      
Q2053115 ['Pat McGrath', 'Pat McGrath (make-up artist)']
Q48814857 ['Violence (Editors album)', 'Magazine (Editors song)', 'Hallelujah (So Low)']
Q7601873 ['Stargaze', 'StarGaze']
Q7976034 ['Wayne Bell', 'Wayne Bell (disambiguation)']
Q7345802 ['Robert Ironside (footballer)', 'Robert Ironside']
Q1227528 ['Directorate of Military Intelligence', 'Directorate of Military Intelligence (United Kingdom)']
Q195154 ['Rachel Corrie', 'Images of Rachel Corrie']
Q4409967 ['A modern fusion splicer', 'Fusion splicing']
Q2135463 ['Puhar, Nagapattinam', 'Puhar']
Q7207804 ['Pohádka', 'Pohádka (disambiguation)']
Q1660897 ['Inan', 'İnan']
Q7333686 ['John Boden', 'John Boden (cricketer)']
Q5552927 ['Gerry Mullan (footballer)', 'Gerry Mullan']
Q1335355 ['Brotula', 'Viviparous brotula']
Q12956809 ['Wild League (water polo)', 'Wild League']
Q19083 ['Kingdom of Iberia (antiquity)', 'Kingdom of Iberia']
Q933263 ['Ruby laser', 'A ruby laser']
Q7614510 ['Steven Bradbury (disambiguation)', 'Steven Bradbury']
Q5672371 ['Harry Simmons', 'Harry Simmons (baseball)']
Q6265672 ['John Westbury (MP)', 'John de Westbury']
Q7807635 ['Timpanogos', 'Timpanogos (disambiguation)']
Q42308 ['Occupation of Kharkiv', 'Kharkiv']
Q4740999 ['Ambale', 'Ambale, Chamarajanagar']
Q4770391 ['Another Weekend', 'Another Weekend (Five Star song)']
Q1890696 ['Mannerheim (family)', 'Mannerheim (disambiguation)']
Q2669498 ['Nacajuca', 'Nacajuca Municipality']
Q7680683 ['Tamanduateí (São Paulo Metro)', 'Tamanduateí (CPTM)']
Q1677353 ['Jackson Lake', 'Jackson Lake State Park']
Q26361 ['Podkamennaya Tunguska', 'Podkamennaya Tunguska River']
Q7286397 ['Rajpuri, Raigad', 'Rajpuri']
Q6575542 ['Gallery of United States Supreme Court composition templates', 'List of Justices of the Supreme Court of the United States by court composition']
Q3459340 ['Kaimri, Estonia', 'Kaimri']
Q7098625 ['Opposition (Malaysia)', 'Leader of the Opposition (Malaysia)']
Q1798855 ['La Hague', 'Cap de la Hague']
Q1361367 ['Erotikon', 'Erotikon (1920 film)']
Q16797896 ['San Carlos Bay (disambiguation)', 'San Carlos Bay']
Q18340038 ['Ishige (alga)', 'Ishige']
Q5237295 ['David McClure (footballer)', 'David McClure']
Q5537920 ['George Clancy (politician)', 'George Clancy']
Q5880780 ["Holland's Leaguer (play)", "Holland's Leaguer"]
Q2829213 ['Al-Fath ibn Khaqan (al-Andalus)', 'Al-Fath ibn Khaqan']
Q5432318 ['Falling Forward (Sandi Patty album)', 'Falling Forward']
Q5250281 ['Deep Roots (radio program)', 'Deep Roots']
Q5301815 ['Douglas Miller', 'Douglas Miller (Alberta politician)']
Q5503475 ['Friday (comics)', 'Friday (2000 AD)']
Q16931046 ['Red Oak Creek', 'Red Oak Creek (Trinity River)']
Q6833259 ['Michael Osborne (footballer)', 'Michael Osborne']
Q6396067 ['Kevin Corby', 'Kevin Corby (cricketer)']
Q2213258 ['Lode Wyns (athlete)', 'Lode Wyns']
Q7354485 ['Rock Is Dead', 'Rock Is Dead (The Doors song)']
Q2946994 ['Cesare Benedetti (disambiguation)', 'Cesare Benedetti']
Q6830341 ['Michael Flynn (disambiguation)', 'Michael Flynn']
Q7791556 ['Thomas Kirkpatrick', 'Thomas Kirkpatrick (Canadian politician)']
Q1459271 ['Similarity Matrix of Proteins', 'SIMAP']
fabiopetroni commented 3 years ago

Hey @PaulLerner,

many thanks for the message. :) We actually never used the wikidata mapping, it was a best effort exercise - it can definitely be improved! Any cache you could help in fixing it? :)

Thanks a lot, Fabio

PaulLerner commented 3 years ago

Thank you for your answer, Not sure I can be of much of much help, I’ve never worked with a Wikipedia dump before :/ I guess the most straightforward way would be to follow the 'wikidata item' links in the wikipedia page, but again, maybe this was already done and the examples I have found were fixed later by wiki contributors?

Also, it’d be nice to know if the page is a disambiguation page or not so one could apply a heuristic to discard them.

Finally, another heuristic is to consider only articles that are used as provenance in one of the datasets, this would give an accurate mapping at least for the relevant articles…

Bests, Paul