Closed lounerios closed 4 years ago
lounerios++, good spot :-) I see them too, and there are also quite a lot of exact duplicates (where date, id, ranking, and points are the same) as well.
Mostly they are scattered here and there, but there are two dates that stand out :
tennis=> select foo.date, count(*) from (select date, id, count(*) from atp_ranking group by date, id having count(*) > 1) foo group by foo.date having count(*) > 10 order by foo.date;
date | count |
---|---|
1990-01-01 | 17 |
2000-01-10 | 18 |
(2 rows)
These are the duplicated rows for date of '2000-01-10' :
date | ranking | id | points |
---|---|---|---|
1990-01-01 | 1102 | 100830 | 1 |
1990-01-01 | 1102 | 100830 | 1 |
1990-01-01 | 1102 | 101146 | 1 |
1990-01-01 | 1102 | 101146 | 1 |
1990-01-01 | 1102 | 101435 | 1 |
1990-01-01 | 1102 | 101435 | 1 |
1990-01-01 | 1102 | 101450 | 1 |
1990-01-01 | 1102 | 101450 | 1 |
1990-01-01 | 1102 | 101821 | 1 |
1990-01-01 | 1102 | 101821 | 1 |
1990-01-01 | 1102 | 101850 | 1 |
1990-01-01 | 1102 | 101850 | 1 |
1990-01-01 | 1102 | 101850 | 1 |
1990-01-01 | 1102 | 101850 | 1 |
1990-01-01 | 1102 | 101865 | 1 |
1990-01-01 | 1102 | 101865 | 1 |
1990-01-01 | 1102 | 102080 | 1 |
1990-01-01 | 1102 | 102080 | 1 |
1990-01-01 | 1102 | 102257 | 1 |
1990-01-01 | 1102 | 102257 | 1 |
1990-01-01 | 1102 | 106696 | 1 |
1990-01-01 | 1102 | 106696 | 1 |
1990-01-01 | 1102 | 107868 | 1 |
1990-01-01 | 1102 | 107868 | 1 |
1990-01-01 | 1102 | 108290 | 1 |
1990-01-01 | 1102 | 108290 | 1 |
1990-01-01 | 1102 | 108290 | 1 |
1990-01-01 | 1102 | 108507 | 1 |
1990-01-01 | 1102 | 108507 | 1 |
1990-01-01 | 1102 | 109490 | 1 |
1990-01-01 | 1102 | 109490 | 1 |
1990-01-01 | 1102 | 117522 | 1 |
1990-01-01 | 1102 | 117522 | 1 |
1990-01-01 | 1102 | 124713 | 1 |
1990-01-01 | 1102 | 124713 | 1 |
1990-01-01 | 1102 | 124713 | 1 |
(36 rows)
I haven't found a pattern to their occurrence yet, if I do I'll post more here.
hope this helps, bazzaar
OK, some further info re. the atp rankings data duplication
I see a total of 443 separate id's that have duplication of some sort in the atp ranking data, though most of these it's just one or two occurrences. However, there are 26 id's which are duplicated in multiple ranking lists :
tennis=> select foo.id, bar.lname, bar.fname, bar.dob, bar.country, bar.hand, count(*) as dups from atp_player bar, (select date, id, count(*) from atp_ranking group by date, id having count(*) > 1) foo where bar.id = foo.id group by foo.id, bar.lname, bar.fname, bar.dob, bar.country, bar.hand having count(*) > 3 order by bar.lname, bar.fname;
id | lname | fname | dob | country | hand | dups | issue |
---|---|---|---|---|---|---|---|
109492 | Beaskoetxea Etxabarr | Jonathan | 1985-05-31 | ESP | R | 29 | #98 |
109494 | Guluzian | Joseph | 1981-11-15 | SWE | R | 23 | #100 |
103149 | Hellstrom | Mathias | 1978-02-21 | SWE | R | 5 | |
200713 | Iannaccone | Federico | 1999-03-26 | ITA | U | 56 | #89 |
109488 | Johnson | Chris | 1987-08-29 | USA | U | 35 | #90 |
109493 | Kurtovic | Boris | 1986-05-17 | CRO | U | 29 | #101 |
109462 | Mcgregor | Dane | 1980-09-01 | RSA | U | 37 | #91 |
105686 | Picco | Francesco | 1991-01-01 | ITA | U | 4 | |
105287 | Regnat | Phillip | 1989-02-01 | GER | R | 5 | |
109453 | Reza | Amir | IRI | R | 26 | #92 | |
121268 | Ribeiro Frattini | Felipe | 1991-06-17 | BRA | R | 11 | |
103557 | Rithiwattanapong | Attapol | 1980-05-04 | THA | U | 18 | |
102424 | Sakamoto | Masahide | 1974-07-02 | JPN | R | 4 | |
105596 | Satral | Jan | 1990-07-24 | CZE | R | 17 | #99 |
108858 | Smith | Jermaine | 1979-01-31 | JAM | U | 8 | |
109586 | Stark | Philipp | 1981-01-12 | GER | U | 46 | #96 |
109473 | Stone | Alex | USA | U | 29 | #94 | |
103521 | Stoppini | Andrea | 1980-02-29 | ITA | R | 20 | #97 |
102130 | Sutter | Jeremy | 1972-11-08 | USA | R | 5 | |
102138 | Tabares | Alexander | 1972-11-20 | CUB | R | 4 | |
101825 | Tejada | Manuel | 1970-11-26 | ESA | R | 17 | #103 |
102721 | Tombolini | Alessandro | 1976-02-03 | ITA | R | 27 | #95 |
101930 | Troost | Huib | 1971-07-01 | NED | R | 7 | |
109491 | Turini | Andrea | 1986-03-25 | ITA | U | 52 | #93 |
136381 | Unknown | Unknown | USA | U | 20 | delete | |
109471 | Zlatnik | Alexander | 1979-08-23 | AUT | R | 33 | #102 |
(26 rows)
These may be cases in the data where an id is being shared by more than player, or alternatively where the duplication might point towards a systematic error in the data collection.
I'll investigate further and report back, here.
UPDATE : ok, so the ranking data records under the first id that I looked at :
id | lname | fname | dob | country | hand | count |
---|---|---|---|---|---|---|
200713 | Iannaccone | Federico | 1999-03-26 | ITA | U | 56 |
.. actually turns out to be the ranking data for two players all assigned to the one id, the other player being :
id | lname | fname | dob | country | hand |
---|---|---|---|---|---|
? | Francesco | Forti | 1999-07-26 | ITA | R |
I suspect that a number of the 25 or so id's listed above will prove to have composite data like this, so I will create an individual issue for each one, and suggest the necessary changes to be made in order to fix the data.
Hope this helps, bazzaar
Yes you are right. I noticed that the wrong entries are following a pattern, so i agree that the rankings are for another player.
I started to remove the wrong entries and i checked the ATP tour's site in order to find the correct rankings.
Best Regards, Lucas
while it's tempting to see most of these as just simply duplicated rows, the picture is complicated by :
Upon investigating the ranking data under this id :
id | lname | fname | dob | country | hand | count _dupls |
---|---|---|---|---|---|---|
136381 | Unknown | Unknown | USA | U | 20 |
It seems that the records collectively represent the fragments of 3 different 'unknown players' ranking history. It appears that these 'unknown players' data have been deleted in an incomplete pattern. I've tried to match the records to a specifIc player(s), but to no avail. These data may have been dummy records, input for training purposes for example. I think the best course of action is to delete the ranking data under this id.
OK, as of commit fe99318, the 'duplicated' ranking records outlined above have been deleted. So I reckon we can close this issue. Refer to issue #109 for the now missing players ranking records.
I have found that some players have duplicate entries in the atp_rankings files for specific dates. The player has two entries with different points for a specific date.
The list of the player ids with duplicate entries: 101888 102336 103073 103149 103346 103382 103442 103504 103623 104252 104280 104377 104903 106274 109453 109462 109471 109488 109491 117758 124889 200713
Example of entries: atp_rankings_90s.csv:19970915,1127,117758,4 atp_rankings_90s.csv:19970915,1127,117758,0