JeffSackmann / tennis_atp

ATP Tennis Rankings, Results, and Stats
998 stars 607 forks source link

How to identify matches? #166

Closed chiarazappala closed 2 years ago

chiarazappala commented 2 years ago

I noticed that there is no way to identify matches (e. g., no match-id). Match number is not unique for certain tourneys. Is there a way to fix this issue?

jwhastings commented 2 years ago

The combination of tourney_id and match_num should be unique for every match. This is how I checked the men's singles ATP data:

atp_tennis %>% count(tourney_id, match_num) %>% filter(n > 1)

chiarazappala commented 2 years ago

Unfortunately, I found exceptions to that (e.g. the Australian Open 2019, there are both the same tourney_id 2019-580 and match_num 164 for two matches). I had to consider the round, too. I hope it will solve the issue.

jwhastings commented 2 years ago

Can you clarify which data you are looking at? In the atp_matches_2019.csv I can only find one row (line 188) for that particular combination of tourney_id and match_num:

2019-580,Australian Open,Hard,128,G,20190114,164,104925,1,,Novak Djokovic,R,188,SRB,31.6495550992,104542,,WC,Jo-Wilfried Tsonga,R,188,FRA,33.7440109514,6-3 7-5 6-4,5,R64,124,12,1,87,61,45,18,16,3,5,10,1,92,50,36,19,15,4,9,1,9135,177,290

What is the other match you found?

chiarazappala commented 2 years ago

I also added qualifications (atp_matches_qual_chall_2019.csv) and I found duplications: 2019-580 Australian Open Hard 128 G 2019-01-14 164 104925 1.0 Novak Djokovic R 188.0 SRB 31.6495550992 104542 WC Jo-Wilfried Tsonga R 188.0 FRA 33.7440109514 6-3 7-5 6-4 5 R64 124.0 12.0 1.0 87.0 61.0 45.0 18.0 16.0 3.0 5.0 10.0 1.0 92.0 50.0 36.0 19.0 15.0 4.0 9.0 1.0 9135.0 177.0 290.0 2019-580 Australian Open Hard 128 G 2019-01-14 164 132283 1.0 Lorenzo Sonego R 191.0 ITA 23.6796714579 105216 Yuichi Sugita R 173.0 JPN 30.3216974675 6-1 6-2 3 Q2 62.0 5.0 1.0 46.0 25.0 22.0 12.0 8.0 0.0 0.0 1.0 1.0 45.0 30.0 17.0 6.0 7.0 5.0 9.0 104.0 549.0 145.0 378.0

jwhastings commented 2 years ago

Nice catch! I didn't consider duplicates across different data sources, though it would make sense practically to combine the qualifiers and main draw together. I'm not sure if the intent is for match_num to be unique across data files. Perhaps you could add a new variable like draw for your analysis to differentiate between qualifying and main?

Mick1303 commented 2 years ago

“Blind” indexes is a “double-edged sword” – it can in some cases compromise your data. Take for instance Hong Kong 1973 tournament. Tennis Abstract imported data from somewhere (probably ATP, maybe ITF) and so did I. Later ATP posted updated data and it turned out that two players (Fred Stolle and Anand Amritraj) who won their R32 are swapped opponents, comparing to the previous version of a draw. How the numerical index of a match will help anyone to fix the issue? It doesn’t. You still have to revise the draw either by manual lookup, or synthesizing the key from round, winner, loser, score, tournament, year, etc.

JeffSackmann commented 2 years ago

Ideally I would fix this; it's not a priority. In the meantime, you can use @jwhastings suggestion

Perhaps you could add a new variable like draw for your analysis to differentiate between qualifying and main?