MScientistCM / CMDuplicatesFinder

A tool to find duplicate entries in a cm0102 database.
1 stars 0 forks source link

some missed duplicates from v0.5-beta to decide where there is room for a further improvement #10

Open MScientistCM opened 5 years ago

MScientistCM commented 5 years ago
David    Valenzuela    _NONE    14.12.1967    Spain    _NONE
David    Valenzuela Lozano    _NONE    14.12.1967    Spain    _NONE
Parviz    Garakhanov    _NONE    13.02.1979    Azerbaijan    _NONE
Pärviz    Karaxanov    _NONE    13.02.1979    Azerbaijan    _NONE
Craig    Malcolm    _NONE    31.12.1969    Scotland    _NONE
Craig    Malcolm    _NONE    30.12.1968    Scotland    _NONE
Antonio    Matas    _NONE    27.05.1971    Spain    _NONE
Antonio    Matas    _NONE    16.05.1972    Spain    _NONE
Danny    Stoney    _NONE    05.05.1979    Scotland    _NONE
Daniel    Stoney    _NONE    05.05.1979    Scotland    _NONE
Sandro    Keckel    _NONE    08.03.1984    Austria    _NONE
Sandro Julian    Keckel    _NONE    08.03.1984    Austria    _NONE
Khalifa Ayil    Al Naufli    _NONE    01.03.1967    Oman    _NONE
Khalifa Ayil Salim    Al-Naufali    Khalifa Ayil    01.03.1967    Oman    _NONE
Su-yong    Bae    Bae Su-yong    07.06.1981    South Korea    Gamba Osaka
Soo-Yong    Bae    Bae Soo-Yong    07.06.1981    South Korea    _NONE
José Manuel    Pane    _NONE    28.05.1976    Spain    _NONE
Jose Manuel    Pane Nieto    _NONE    28.05.1976    Spain    _NONE
Yacine    Bezzaz    _NONE    10.07.1964    Algeria    CS Constantine
Yassine    Bezzaz    _NONE    10.07.1964    Algeria    MC El Eulma
Jeff    Nzokira    _NONE    24.10.1970    Burundi    As Ali Sabieh
Jeff    Nzorika    _NONE    24.10.1970    Burundi    _NONE
Aleksandar    Lazarevic    _NONE    17.09.1974    Denmark    Hvidovre IF
Alexander    Lazarevic    _NONE    17.09.1974    Denmark    _NONE
Paco    Peña    _NONE    25.07.1961    Spain    _NONE
Francisco    Peña    _NONE    25.07.1961    Spain    _NONE
Vladislav    Levin    _NONE    28.03.1978    Russia    FC Bohemians Prag 1905
Vladislav    Lyovin    _NONE    28.03.1978    Russia    FC Vysocina Jihlava
Maksim    Kirsanov    _NONE    08.05.1970    Russia    FC Zugdidi
Maxim    Kirsanov    _NONE    08.05.1970    Russia    Vityaz Podolsk
Jonathan    Vervoort    _NONE    13.08.1976    Belgium    FCV Dender EH
Jonathan    Vervoort    _NONE    13.08.1971    Belgium    _NONE
Patrice    Feussi    _NONE    03.10.1969    Cameroon    Concordia Chiajna
Patrice    Feussi    _NONE    03.10.1967    Cameroon    _NONE
Thomas    Parada    _NONE    16.04.1979    France    Stade Laval
Thomas    Parada    _NONE    16.04.1980    France    _NONE
Máté    Czingráber    _NONE    13.06.1980    Hungary    Soproni VSE
Máté    Czingráber    _NONE    13.06.1979    Hungary    Vasas FC
Nicolò    Lini    _NONE    13.04.1976    Italy    SP Tre Fiori
Nicolò    Lini    _NONE    13.04.1977    Italy    USD Grumellese Calcio
Emmet    Friars    _NONE    14.09.1968    Northern Ireland    Limavady United
Emmet    Friars    _NONE    14.09.1972    Northern Ireland    _NONE
Corey    Wilson    _NONE    07.11.1976    Northern Ireland    Tasman United
Corey    Wilson    _NONE    07.11.1977    Northern Ireland    Institute FC
Ciaran    Summers    _NONE    16.04.1978    Scotland    Queen's Park FC
Ciaran    Summers    _NONE    16.04.1979    Scotland    _NONE
James    Creaney    _NONE    19.10.1971    Scotland    Annan Athletic FC
James    Creaney    _NONE    19.10.1972    Scotland    _NONE
Yefferson    Moreira    _NONE    07.03.1974    Uruguay    CA Rentistas
Yefferson    Moreira    _NONE    07.03.1973    Uruguay    El Tanque Sisley
Yeong-shin    Kim    Kim Yeong-shin    28.02.1969    South Korea    Gangwon FC
Young-Sin    Kim    Kim Young-Sin    28.02.1969    South Korea    _NONE
Il-su    Hwang    Hwang Il-su    08.08.1970    South Korea    Ulsan Hyundai
Il-Soo    Hwang    Hwang Il-Soo    08.08.1970    South Korea    _NONE
Seung-uh    Ryu    Ryu Seung-uh    17.12.1976    South Korea    Jeju United
Seung-Woo    Ryu    Ryu Seung-Woo    17.12.1976    South Korea    _NONE
Sang-un    Han    Han Sang-un    03.05.1969    South Korea    Busan IPark FC
Sang-Woon    Han    Han Sang-Woon    03.05.1969    South Korea    _NONE
Amit    Zenati    _NONE    02.04.1980    Israel    Maccabi Haifa
Amit    Zanetti    _NONE    02.07.1980    Israel    Maccabi Haifa
Nando    Quesada    _NONE    05.01.1977    Spain    Elche C.F.
Fernando    Quesada    _NONE    05.01.1977    Spain    UE Llagostera
Chakkit    Laptrakul    _NONE    02.12.1977    Thailand    Bangkok Glass FC
Jakkrit    Larbtrakool    _NONE    02.12.1977    Thailand    Bangkok Glass FC
Norman    Kloss    _NONE    19.06.1980    Germany    Bischofswerdaer FV 08
Norman    Kloß    _NONE    19.06.1980    Germany    Budissa Bautzen
Jin-uk    Jung    _NONE    28.05.1980    South Korea    _NONE
Jin-uk    Jeong    _NONE    28.05.1980    South Korea    FC Seoul
Mario    Vratovic    _NONE    01.02.1964    Croatia    _NONE
Mario    Vratovic    _NONE    09.12.1963    Croatia    _NONE
Dae-uk    Kim    Kim Dae-uk    23.11.1970    South Korea    FC Anyang
Dae-Wook    Kim    Kim Dae-Wook    23.11.1970    South Korea    Auckland City FC

Thanks Andrea for feedback.

Note: some of list above actually have 2 differences in DOBs although they appear with only one difference in the list. Maybe adjust the code to detect 2 differences in DOB if they are very similar like that.

MScientistCM commented 5 years ago

Thanks andrea for feedback below:

I have an idea about the two differences in DOB issue (but I didn't check it, so you may want to have a look).

I think the problem may be due to a known bug both in TransferTool and in DBExporter export function which leads to a one day error in exported DOBs, here and there, with no evident pattern. In the past I tried to figured out why and how to correct it, with no success.

I think the problem is not present in the DBUpdater tool. I am actually using DB Updater export for all of my activities.

If that's the case, I think it is probably better for you to move to CM0102 Updater tool export rather than altering the sensitivity of your algorithms to cope with TransferTool bug, as that may increase false positives.

MScientistCM commented 5 years ago

Thanks Dermotron for feedback below:

If I remember correctly archiebalduk figured out why this is. It was something along the lines of the numbers being stored a text and when converting to excel they are rounding up or rounding down. That's the complicated bit as whilst it's text, it's behaving as a date e.g. 01.06.1982 doesn't become 01.06.1983 but rather "correctly" 31.05.1982