JeffSackmann / tennis_atp

ATP Tennis Rankings, Results, and Stats
998 stars 607 forks source link

ATP Rankings: Duplicate entries for some players #88

Closed lounerios closed 4 years ago

lounerios commented 5 years ago

I have found that some players have duplicate entries in the atp_rankings files for specific dates. The player has two entries with different points for a specific date.

The list of the player ids with duplicate entries: 101888 102336 103073 103149 103346 103382 103442 103504 103623 104252 104280 104377 104903 106274 109453 109462 109471 109488 109491 117758 124889 200713

Example of entries: atp_rankings_90s.csv:19970915,1127,117758,4 atp_rankings_90s.csv:19970915,1127,117758,0

bazzaar commented 5 years ago

lounerios++, good spot :-) I see them too, and there are also quite a lot of exact duplicates (where date, id, ranking, and points are the same) as well.

Mostly they are scattered here and there, but there are two dates that stand out :

tennis=> select foo.date, count(*) from (select date, id, count(*) from atp_ranking group by date, id having count(*) > 1) foo group by foo.date having count(*) > 10 order by foo.date;

date count
1990-01-01 17
2000-01-10 18

(2 rows)

These are the duplicated rows for date of '2000-01-10' :

date ranking id points
1990-01-01 1102 100830 1
1990-01-01 1102 100830 1
1990-01-01 1102 101146 1
1990-01-01 1102 101146 1
1990-01-01 1102 101435 1
1990-01-01 1102 101435 1
1990-01-01 1102 101450 1
1990-01-01 1102 101450 1
1990-01-01 1102 101821 1
1990-01-01 1102 101821 1
1990-01-01 1102 101850 1
1990-01-01 1102 101850 1
1990-01-01 1102 101850 1
1990-01-01 1102 101850 1
1990-01-01 1102 101865 1
1990-01-01 1102 101865 1
1990-01-01 1102 102080 1
1990-01-01 1102 102080 1
1990-01-01 1102 102257 1
1990-01-01 1102 102257 1
1990-01-01 1102 106696 1
1990-01-01 1102 106696 1
1990-01-01 1102 107868 1
1990-01-01 1102 107868 1
1990-01-01 1102 108290 1
1990-01-01 1102 108290 1
1990-01-01 1102 108290 1
1990-01-01 1102 108507 1
1990-01-01 1102 108507 1
1990-01-01 1102 109490 1
1990-01-01 1102 109490 1
1990-01-01 1102 117522 1
1990-01-01 1102 117522 1
1990-01-01 1102 124713 1
1990-01-01 1102 124713 1
1990-01-01 1102 124713 1

(36 rows)

I haven't found a pattern to their occurrence yet, if I do I'll post more here.

hope this helps, bazzaar

bazzaar commented 5 years ago

OK, some further info re. the atp rankings data duplication

I see a total of 443 separate id's that have duplication of some sort in the atp ranking data, though most of these it's just one or two occurrences. However, there are 26 id's which are duplicated in multiple ranking lists :

tennis=> select foo.id, bar.lname, bar.fname, bar.dob, bar.country, bar.hand, count(*) as dups from atp_player bar, (select date, id, count(*) from atp_ranking group by date, id having count(*) > 1) foo where bar.id = foo.id group by foo.id, bar.lname, bar.fname, bar.dob, bar.country, bar.hand having count(*) > 3 order by bar.lname, bar.fname;

id lname fname dob country hand dups issue
109492 Beaskoetxea Etxabarr Jonathan 1985-05-31 ESP R 29 #98
109494 Guluzian Joseph 1981-11-15 SWE R 23 #100
103149 Hellstrom Mathias 1978-02-21 SWE R 5
200713 Iannaccone Federico 1999-03-26 ITA U 56 #89
109488 Johnson Chris 1987-08-29 USA U 35 #90
109493 Kurtovic Boris 1986-05-17 CRO U 29 #101
109462 Mcgregor Dane 1980-09-01 RSA U 37 #91
105686 Picco Francesco 1991-01-01 ITA U 4
105287 Regnat Phillip 1989-02-01 GER R 5
109453 Reza Amir IRI R 26 #92
121268 Ribeiro Frattini Felipe 1991-06-17 BRA R 11
103557 Rithiwattanapong Attapol 1980-05-04 THA U 18
102424 Sakamoto Masahide 1974-07-02 JPN R 4
105596 Satral Jan 1990-07-24 CZE R 17 #99
108858 Smith Jermaine 1979-01-31 JAM U 8
109586 Stark Philipp 1981-01-12 GER U 46 #96
109473 Stone Alex USA U 29 #94
103521 Stoppini Andrea 1980-02-29 ITA R 20 #97
102130 Sutter Jeremy 1972-11-08 USA R 5
102138 Tabares Alexander 1972-11-20 CUB R 4
101825 Tejada Manuel 1970-11-26 ESA R 17 #103
102721 Tombolini Alessandro 1976-02-03 ITA R 27 #95
101930 Troost Huib 1971-07-01 NED R 7
109491 Turini Andrea 1986-03-25 ITA U 52 #93
136381 Unknown Unknown USA U 20 delete
109471 Zlatnik Alexander 1979-08-23 AUT R 33 #102

(26 rows)

These may be cases in the data where an id is being shared by more than player, or alternatively where the duplication might point towards a systematic error in the data collection.

I'll investigate further and report back, here.

UPDATE : ok, so the ranking data records under the first id that I looked at :

id lname fname dob country hand count
200713 Iannaccone Federico 1999-03-26 ITA U 56

.. actually turns out to be the ranking data for two players all assigned to the one id, the other player being :

id lname fname dob country hand
? Francesco Forti 1999-07-26 ITA R

I suspect that a number of the 25 or so id's listed above will prove to have composite data like this, so I will create an individual issue for each one, and suggest the necessary changes to be made in order to fix the data.

Hope this helps, bazzaar

lounerios commented 5 years ago

Yes you are right. I noticed that the wrong entries are following a pattern, so i agree that the rankings are for another player.

I started to remove the wrong entries and i checked the ATP tour's site in order to find the correct rankings.

Best Regards, Lucas

bazzaar commented 5 years ago

while it's tempting to see most of these as just simply duplicated rows, the picture is complicated by :

bazzaar commented 5 years ago

Upon investigating the ranking data under this id :

id lname fname dob country hand count _dupls
136381 Unknown Unknown USA U 20

It seems that the records collectively represent the fragments of 3 different 'unknown players' ranking history. It appears that these 'unknown players' data have been deleted in an incomplete pattern. I've tried to match the records to a specifIc player(s), but to no avail. These data may have been dummy records, input for training purposes for example. I think the best course of action is to delete the ranking data under this id.

bazzaar commented 5 years ago

OK, as of commit fe99318, the 'duplicated' ranking records outlined above have been deleted. So I reckon we can close this issue. Refer to issue #109 for the now missing players ranking records.