JeffSackmann / tennis_wta

WTA Tennis Rankings, Results, and Stats
217 stars 144 forks source link

Mojibake in wta_player.csv #18

Closed boffi closed 4 years ago

boffi commented 5 years ago

I have problems parsing wta_players.csv with Python3 csv.reader, that tries to decode the binary data using UTF8.

The offending records, found searching for non-ascii characters in an editor, are the following

212305,Joselyn Margarita,Treyes Albarrac纃N,,19970629,ECU
215238,Selin G羮Lseren,Simsek,U,19990509,TUR
221676,Ludmila Magal罸,Alvez,R,20011129,ARG

It seems to me that they should be Albarracín and Magalí and possibly Gülseren but the WTA site reports only of Selin Simsek w/o a middle name:

212305,Joselyn Margarita,Treyes Albarracín,,19970629,ECU
215238,Selin Gülseren,Simsek,U,19990509,TUR
221676,Ludmila Magalí,Alvez,R,20011129,ARG

(when I correct the file as above I can parse the data with Python3's csv.reader).

On the other hand it looks like the rest of the data is strictly ascii, so maybe it should be

212305,Joselyn Margarita,Treyes Albarracin,,19970629,ECU
215238,Selin Gulseren,Simsek,U,19990509,TUR
221676,Ludmila Magali,Alvez,R,20011129,ARG

Regards ፨ gb

JeffSackmann commented 4 years ago

thanks, now fixed. I must have changed the non-ascii chars of the names mentioned in the issue at some point in the past, but I seem to have changed them to the wrong ascii chars. Better now.