Closed chiarazappala closed 2 years ago
The combination of tourney_id
and match_num
should be unique for every match. This is how I checked the men's singles ATP data:
atp_tennis %>% count(tourney_id, match_num) %>% filter(n > 1)
Unfortunately, I found exceptions to that (e.g. the Australian Open 2019, there are both the same tourney_id
2019-580 and match_num
164 for two matches).
I had to consider the round, too.
I hope it will solve the issue.
Can you clarify which data you are looking at? In the atp_matches_2019.csv
I can only find one row (line 188) for that particular combination of tourney_id
and match_num
:
2019-580,Australian Open,Hard,128,G,20190114,164,104925,1,,Novak Djokovic,R,188,SRB,31.6495550992,104542,,WC,Jo-Wilfried Tsonga,R,188,FRA,33.7440109514,6-3 7-5 6-4,5,R64,124,12,1,87,61,45,18,16,3,5,10,1,92,50,36,19,15,4,9,1,9135,177,290
What is the other match you found?
I also added qualifications (atp_matches_qual_chall_2019.csv
) and I found duplications:
2019-580 Australian Open Hard 128 G 2019-01-14 164 104925 1.0 Novak Djokovic R 188.0 SRB 31.6495550992 104542 WC Jo-Wilfried Tsonga R 188.0 FRA 33.7440109514 6-3 7-5 6-4 5 R64 124.0 12.0 1.0 87.0 61.0 45.0 18.0 16.0 3.0 5.0 10.0 1.0 92.0 50.0 36.0 19.0 15.0 4.0 9.0 1.0 9135.0 177.0 290.0
2019-580 Australian Open Hard 128 G 2019-01-14 164 132283 1.0 Lorenzo Sonego R 191.0 ITA 23.6796714579 105216 Yuichi Sugita R 173.0 JPN 30.3216974675 6-1 6-2 3 Q2 62.0 5.0 1.0 46.0 25.0 22.0 12.0 8.0 0.0 0.0 1.0 1.0 45.0 30.0 17.0 6.0 7.0 5.0 9.0 104.0 549.0 145.0 378.0
Nice catch! I didn't consider duplicates across different data sources, though it would make sense practically to combine the qualifiers and main draw together. I'm not sure if the intent is for match_num
to be unique across data files. Perhaps you could add a new variable like draw
for your analysis to differentiate between qualifying
and main
?
“Blind” indexes is a “double-edged sword” – it can in some cases compromise your data. Take for instance Hong Kong 1973 tournament. Tennis Abstract imported data from somewhere (probably ATP, maybe ITF) and so did I. Later ATP posted updated data and it turned out that two players (Fred Stolle and Anand Amritraj) who won their R32 are swapped opponents, comparing to the previous version of a draw. How the numerical index of a match will help anyone to fix the issue? It doesn’t. You still have to revise the draw either by manual lookup, or synthesizing the key from round, winner, loser, score, tournament, year, etc.
Ideally I would fix this; it's not a priority. In the meantime, you can use @jwhastings suggestion
Perhaps you could add a new variable like draw for your analysis to differentiate between qualifying and main?
I noticed that there is no way to identify matches (e. g., no match-id). Match number is not unique for certain tourneys. Is there a way to fix this issue?