dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
841 stars 271 forks source link

Jandek's birthday in PersonData is xsd:gMonthDay, not xsd:date #163

Open twistedvisions opened 10 years ago

twistedvisions commented 10 years ago

The data seems a bit off here, in the live store, you show it as both a gMonthDay and a date: http://live.dbpedia.org/page/Jandek

But in the persondata dump, it only has the gMonthDay:

<http://dbpedia.org/resource/Jandek> <http://dbpedia.org/ontology/birthDate> "--10-26"^^<http://www.w3.org/2001/XMLSchema#gMonthDay> <http://en.wikipedia.org/wiki/Jandek?oldid=545323015#section=External_links&relative-line=23&absolute-line=228> .
twistedvisions commented 10 years ago

Hope I'm not spamming you good folk with too many issues! Please let me know if I should let up...

twistedvisions commented 10 years ago

There is a similar issue with http://live.dbpedia.org/page/John_E._Wool

twistedvisions commented 10 years ago

http://live.dbpedia.org/page/Martin_Waldseem%C3%BCller also has an issue with his PersonData birthdate, though that could be because of the ", about" at the end?

| DATE OF BIRTH = 1470, about

twistedvisions commented 10 years ago

One last example I've looked at: http://live.dbpedia.org/page/Joe_Avezzano

has an infobox death date of:

| death_date ={{Dda|2012|4|5|1943|11|17|df=y}}

and a

persondata death date of:

| DATE OF DEATH = April 5th 2012

The persondata parsed date is:

<http://dbpedia.org/resource/Joe_Avezzano> <http://dbpedia.org/ontology/deathDate> "--04-05"^^<http://www.w3.org/2001/XMLSchema#gMonthDay> <http://en.wikipedia.org/wiki/Joe_Avezzano?oldid=544283436#section=External_link&relative-line=21&absolute-line=185> .
twistedvisions commented 10 years ago

Here is a csv with all the parse issues (Mostly date related) for people from the start of the persondata file:

Event name, <internal id - ignore>, birth_date, death_date, link
'Jandek born', 293, '-10-26 00:00 BC', '-10-26 23:59 BC', 'http://en.wikipedia.org/wiki/Jandek'
'Isaac Newton died', 1199, '-03-31 00:00 BC', '-03-31 23:59 BC', 'http://en.wikipedia.org/wiki/Isaac_Newton'
'Isaac Newton born', 33776, '-01-04 00:00 BC', '-01-04 23:59 BC', 'http://en.wikipedia.org/wiki/Isaac_Newton'
'Edgar O''Ballance born', 194, '-07-17 00:00 BC', '-07-17 23:59 BC', 'http://en.wikipedia.org/wiki/Edgar_O''Ballance'
'Ray Panthaki born', 209, '-01-20 00:00 BC', '-01-20 23:59 BC', 'http://en.wikipedia.org/wiki/Ray_Panthaki'
'Mohamed Sissoko born', 50037, '-01-22 00:00 BC', '-01-22 23:59 BC', 'http://en.wikipedia.org/wiki/Mohamed_Sissoko'
'Ate Glow born', 539, '-03-10 00:00 BC', '-03-10 23:59 BC', 'http://en.wikipedia.org/wiki/Ate_Glow'
'Eduardo Cansino, Sr. died', 6509, '-12-24 00:00 BC', '-12-24 23:59 BC', 'http://en.wikipedia.org/wiki/Eduardo_Cansino,_Sr.'
'Herb Bernstein born', 1155, '-05-15 00:00 BC', '-05-15 23:59 BC', 'http://en.wikipedia.org/wiki/Herb_Bernstein'
'Urban Shocker died', 195, '-09-06 00:00 BC', '-09-06 23:59 BC', 'http://en.wikipedia.org/wiki/Urban_Shocker'
'The Scumfrog born', 1, '-10-03 00:00 BC', '-10-03 23:59 BC', 'http://en.wikipedia.org/wiki/The_Scumfrog'
'Jason Jones (singer) born', 1350, '-11-13 00:00 BC', '-11-13 23:59 BC', 'http://en.wikipedia.org/wiki/Jason_Jones_(singer)'
'Claudia Octavia died', 52926, '-06-08 00:00 BC', '-06-08 23:59 BC', 'http://en.wikipedia.org/wiki/Claudia_Octavia'
'Tom Underwood born', 8556, '-12-22 00:00 BC', '-12-22 23:59 BC', 'http://en.wikipedia.org/wiki/Tom_Underwood'
'Tom Underwood died', 7050, '-11-22 00:00 BC', '-11-22 23:59 BC', 'http://en.wikipedia.org/wiki/Tom_Underwood'
'K. P. Kesava Menon died', 1225, '-11-09 00:00 BC', '-11-09 23:59 BC', 'http://en.wikipedia.org/wiki/K._P._Kesava_Menon'
'Christopher Merret born', 38401, '-02-16 00:00 BC', '-02-16 23:59 BC', 'http://en.wikipedia.org/wiki/Christopher_Merret'
'John England (bishop) died', 1476, '-04-11 00:00 BC', '-04-11 23:59 BC', 'http://en.wikipedia.org/wiki/John_England_(bishop)'
'Hubert Taczanowski born', 720, '-10-01 00:00 BC', '-10-01 23:59 BC', 'http://en.wikipedia.org/wiki/Hubert_Taczanowski'
'Charles III of Naples died', 47265, '-02-24 00:00 BC', '-02-24 23:59 BC', 'http://en.wikipedia.org/wiki/Charles_III_of_Naples'
'Alagappa Chettiar died', 660, '-04-05 00:00 BC', '-04-05 23:59 BC', 'http://en.wikipedia.org/wiki/Alagappa_Chettiar'
'Yury Yershov born', 1527, '-05-01 00:00 BC', '-05-01 23:59 BC', 'http://en.wikipedia.org/wiki/Yury_Yershov'
'Jean-Frédéric Waldeck born', 545, '-03-16 00:00 BC', '-03-16 23:59 BC', 'http://en.wikipedia.org/wiki/Jean-Frédéric_Waldeck'
'Jason Smith (rugby league) born', 33594, '-03-14 00:00 BC', '-03-14 23:59 BC', 'http://en.wikipedia.org/wiki/Jason_Smith_(rugby_league)'
'Ronald Ferguson died', 37531, '-03-16 00:00 BC', '-03-16 23:59 BC', 'http://en.wikipedia.org/wiki/Ronald_Ferguson'
'Craig Lauzon born', 517, '-02-03 00:00 BC', '-02-03 23:59 BC', 'http://en.wikipedia.org/wiki/Craig_Lauzon'
'Barry Davies born', 384, '-10-24 00:00 BC', '-10-24 23:59 BC', 'http://en.wikipedia.org/wiki/Barry_Davies'
'John Chisum died', 4424, '-12-23 00:00 BC', '-12-23 23:59 BC', 'http://en.wikipedia.org/wiki/John_Chisum'
'Noah Lewis born', 26098, '-09-03 00:00 BC', '-09-03 23:59 BC', 'http://en.wikipedia.org/wiki/Noah_Lewis'
'Roy Hartsfield died', 7347, '-01-15 00:00 BC', '-01-15 23:59 BC', 'http://en.wikipedia.org/wiki/Roy_Hartsfield'
'Michael Kerr (lawyer) died', 384, '-04-14 00:00 BC', '-04-14 23:59 BC', 'http://en.wikipedia.org/wiki/Michael_Kerr_(lawyer)'
'Andrew Seow born', 616, '-01-01 00:00 BC', '-01-01 23:59 BC', 'http://en.wikipedia.org/wiki/Andrew_Seow'
'Princess Antoinette, Baroness of Massy died', 439, '-03-18 00:00 BC', '-03-18 23:59 BC', 'http://en.wikipedia.org/wiki/Princess_Antoinette,_Baroness_of_Massy'
'William Johnson (judge) born', 1476, '-12-17 00:00 BC', '-12-17 23:59 BC', 'http://en.wikipedia.org/wiki/William_Johnson_(judge)'
'Villaño IV born', 2931, '-04-09 00:00 BC', '-04-09 23:59 BC', 'http://en.wikipedia.org/wiki/Villaño_IV'
'Chris Oyakhilome born', 501, '-12-07 00:00 BC', '-12-07 23:59 BC', 'http://en.wikipedia.org/wiki/Chris_Oyakhilome'
'Ella Joyce born', 179, '-06-12 00:00 BC', '-06-12 23:59 BC', 'http://en.wikipedia.org/wiki/Ella_Joyce'
'Roma Ryan born', 134, '-01-20 00:00 BC', '-01-20 23:59 BC', 'http://en.wikipedia.org/wiki/Roma_Ryan'
'Marie Bigot born', 34086, '-03-03 00:00 BC', '-03-03 23:59 BC', 'http://en.wikipedia.org/wiki/Marie_Bigot'
'Derek Worlock died', 395, '-02-06 00:00 BC', '-02-06 23:59 BC', 'http://en.wikipedia.org/wiki/Derek_Worlock'
'Bob Eberly died', 11447, '-11-17 00:00 BC', '-11-17 23:59 BC', 'http://en.wikipedia.org/wiki/Bob_Eberly'
'Vasily Volsky died', 423, '-02-22 00:00 BC', '-02-22 23:59 BC', 'http://en.wikipedia.org/wiki/Vasily_Volsky'
'John T. McNicholas died', 250736, '-04-22 00:00 BC', '-04-22 23:59 BC', 'http://en.wikipedia.org/wiki/John_T._McNicholas'
'Jacqueline de Romilly died', 30794, '-12-18 00:00 BC', '-12-18 23:59 BC', 'http://en.wikipedia.org/wiki/Jacqueline_de_Romilly'
'Senya Fleshin died', 424, '-06-19 00:00 BC', '-06-19 23:59 BC', 'http://en.wikipedia.org/wiki/Senya_Fleshin'
'Miki Aihara born', 33375, '-06-10 00:00 BC', '-06-10 23:59 BC', 'http://en.wikipedia.org/wiki/Miki_Aihara'
'Simon Clifford born', 37363, '-11-27 00:00 BC', '-11-27 23:59 BC', 'http://en.wikipedia.org/wiki/Simon_Clifford'
'Lisle Nagel died', 43329, '-11-23 00:00 BC', '-11-23 23:59 BC', 'http://en.wikipedia.org/wiki/Lisle_Nagel'
'Rubén Valtierra born', 5278, '-12-26 00:00 BC', '-12-26 23:59 BC', 'http://en.wikipedia.org/wiki/Rubén_Valtierra'
'Patrick William Riordan died', 982, '-12-27 00:00 BC', '-12-27 23:59 BC', 'http://en.wikipedia.org/wiki/Patrick_William_Riordan'
'Mariah Carey born', 32485, '-03-27 00:00 BC', '-03-27 23:59 BC', 'http://en.wikipedia.org/wiki/Mariah_Carey'
'Henry Russell (musician) born', 40195, '-12-24 00:00 BC', '-12-24 23:59 BC', 'http://en.wikipedia.org/wiki/Henry_Russell_(musician)'
'Adamjee Peerbhoy died', 311, '-08-11 00:00 BC', '-08-11 23:59 BC', 'http://en.wikipedia.org/wiki/Adamjee_Peerbhoy'
'Adamjee Peerbhoy born', 311, '-08-13 00:00 BC', '-08-13 23:59 BC', 'http://en.wikipedia.org/wiki/Adamjee_Peerbhoy'
'John William McCormack died', 196, '-11-22 00:00 BC', '-11-22 23:59 BC', 'http://en.wikipedia.org/wiki/John_William_McCormack'
'Marcus Vipsanius Agrippa born', 1891, '-10-23 00:00 BC', '-10-23 23:59 BC', 'http://en.wikipedia.org/wiki/Marcus_Vipsanius_Agrippa'
'Philip VanBrugh born', 31864, '-01-31 00:00 BC', '-01-31 23:59 BC', 'http://en.wikipedia.org/wiki/Philip_VanBrugh'
'Joe Avezzano died', 754, '-04-05 00:00 BC', '-04-05 23:59 BC', 'http://en.wikipedia.org/wiki/Joe_Avezzano'
'Forbes Masson born', 37767, '-08-17 00:00 BC', '-08-17 23:59 BC', 'http://en.wikipedia.org/wiki/Forbes_Masson'
'Tom Russell born', 396, '-03-05 00:00 BC', '-03-05 23:59 BC', 'http://en.wikipedia.org/wiki/Tom_Russell'
'Karen Marie Moning born', 250736, '-11-01 00:00 BC', '-11-01 23:59 BC', 'http://en.wikipedia.org/wiki/Karen_Marie_Moning'
'Emma Calvé died', 1740, '-01-06 00:00 BC', '-01-06 23:59 BC', 'http://en.wikipedia.org/wiki/Emma_Calvé'
'Martin Waldseemüller died', 53656, '-03-16 00:00 BC', '-03-16 23:59 BC', 'http://en.wikipedia.org/wiki/Martin_Waldseemüller'
'Raadhika Sarathkumar born', 1249, '-08-21 00:00 BC', '-08-21 23:59 BC', 'http://en.wikipedia.org/wiki/Raadhika_Sarathkumar'
'George McElroy died', 112304, '-07-31 00:00 BC', '-07-31 23:59 BC', 'http://en.wikipedia.org/wiki/George_McElroy'
'Ijlal Haider Zaidi died', 535, '-03-23 00:00 BC', '-03-23 23:59 BC', 'http://en.wikipedia.org/wiki/Ijlal_Haider_Zaidi'
'Peter Lieberson died', 274401, '-04-23 00:00 BC', '-04-23 23:59 BC', 'http://en.wikipedia.org/wiki/Peter_Lieberson'
'Rowland V. Lee died', 5474, '-12-21 00:00 BC', '-12-21 23:59 BC', 'http://en.wikipedia.org/wiki/Rowland_V._Lee'
'Craig Northey born', 102665, '-02-09 00:00 BC', '-02-09 23:59 BC', 'http://en.wikipedia.org/wiki/Craig_Northey'
'Archie Primrose, Lord Dalmeny died', 518, '-11-11 00:00 BC', '-11-11 23:59 BC', 'http://en.wikipedia.org/wiki/Archie_Primrose,_Lord_Dalmeny'
'E. Elias Merhige born', 1155, '-06-14 00:00 BC', '-06-14 23:59 BC', 'http://en.wikipedia.org/wiki/E._Elias_Merhige'
'Ruggedman born', 264225, '-09-20 00:00 BC', '-09-20 23:59 BC', 'http://en.wikipedia.org/wiki/Ruggedman'
'Radie Harris died', 19157, '-02-22 00:00 BC', '-02-22 23:59 BC', 'http://en.wikipedia.org/wiki/Radie_Harris'
'Oliver Wilkes born', 2721, '-05-02 00:00 BC', '-05-02 23:59 BC', 'http://en.wikipedia.org/wiki/Oliver_Wilkes'
'Martin Denny died', 295, '-03-02 00:00 BC', '-03-02 23:59 BC', 'http://en.wikipedia.org/wiki/Martin_Denny'
'Ángel Maturino Reséndiz died', 27506, '-06-27 00:00 BC', '-06-27 23:59 BC', 'http://en.wikipedia.org/wiki/Ángel_Maturino_Reséndiz'
'Jeff Mayo died', 36408, '-04-17 00:00 BC', '-04-17 23:59 BC', 'http://en.wikipedia.org/wiki/Jeff_Mayo'
'Paul Coia born', 1643, '-06-19 00:00 BC', '-06-19 23:59 BC', 'http://en.wikipedia.org/wiki/Paul_Coia'
'Arnold Rice Rich born', 2016, '-03-28 00:00 BC', '-03-28 23:59 BC', 'http://en.wikipedia.org/wiki/Arnold_Rice_Rich'
'Karly Rothenberg born', 195, '-10-29 00:00 BC', '-10-29 23:59 BC', 'http://en.wikipedia.org/wiki/Karly_Rothenberg'
'Pompey died', 134843, '-09-28 00:00 BC', '-09-28 23:59 BC', 'http://en.wikipedia.org/wiki/Pompey'
'Cookie Gilchrist died', 560, '-01-10 00:00 BC', '-01-10 23:59 BC', 'http://en.wikipedia.org/wiki/Cookie_Gilchrist'
'Gennadi Gerasimov died', 423, '-09-14 00:00 BC', '-09-14 23:59 BC', 'http://en.wikipedia.org/wiki/Gennadi_Gerasimov'
'James Prince born', 293, '-03-12 00:00 BC', '-03-12 23:59 BC', 'http://en.wikipedia.org/wiki/James_Prince'
'Matthias Kleinheisterkamp died', 91978, '-04-29 00:00 BC', '-04-29 23:59 BC', 'http://en.wikipedia.org/wiki/Matthias_Kleinheisterkamp'
'J. Hyam Rubinstein born', 235769, '-03-07 00:00 BC', '-03-07 23:59 BC', 'http://en.wikipedia.org/wiki/J._Hyam_Rubinstein'
'Walter Payton died', 8223, '-11-01 00:00 BC', '-11-01 23:59 BC', 'http://en.wikipedia.org/wiki/Walter_Payton'
'Orkun Uşak born', 85465, '-11-05 00:00 BC', '-11-05 23:59 BC', 'http://en.wikipedia.org/wiki/Orkun_Uşak'
'María Alejandra Martín born', 1206, '-11-23 00:00 BC', '-11-23 23:59 BC', 'http://en.wikipedia.org/wiki/María_Alejandra_Martín'
'John E. Wool born', 35266, '-02-20 00:00 BC', '-02-20 23:59 BC', 'http://en.wikipedia.org/wiki/John_E._Wool'
'Daryl Hayott born', 38780, '-11-05 00:00 BC', '-11-05 23:59 BC', 'http://en.wikipedia.org/wiki/Daryl_Hayott'
'Giovanni Sostero died', 310, '-12-06 00:00 BC', '-12-06 23:59 BC', 'http://en.wikipedia.org/wiki/Giovanni_Sostero'
'Ravinder Pal Singh born', 311, '-06-07 00:00 BC', '-06-07 23:59 BC', 'http://en.wikipedia.org/wiki/Ravinder_Pal_Singh'
'Mark DeCarlo born', 179, '-06-23 00:00 BC', '-06-23 23:59 BC', 'http://en.wikipedia.org/wiki/Mark_DeCarlo'
ninniuz commented 10 years ago

Hi @twistedvisions your contribution is more than welcome! The only problem is that I am finding difficult to examine all those examples at the same time :'(

http://live.dbpedia.org/page/Jandek
http://live.dbpedia.org/page/John_E._Wool
http://live.dbpedia.org/page/Martin_Waldseem%C3%BCller
http://live.dbpedia.org/page/Joe_Avezzano

So to summarize there are cases in which the DateTimeParser cannot cope with the messy data in the the {{Persondata}} template, e.g.:

wbecker commented 10 years ago

@ninniuz Glad to hear it's not just spam. I'm uncovering a whole heap of these bad formats so I'll keep bombarding this issues list!

With the gMonthDay thing: Could you make it fall back to the year, instead of the month day if you are having issues parsing? The year alone describes a well bounded point in time, whereas a month/day happens every year, so it is not so useful!