JeffSackmann / tennis_wta

WTA Tennis Rankings, Results, and Stats
217 stars 144 forks source link

Encoding problem of wta_players.csv #30

Closed benjdv closed 3 years ago

benjdv commented 3 years ago

Some char of the wta_players.csv are not UTF8 char.

222342,Manuela,Zegarra Ball�N,U,,PER The real player seems to be https://www.wtatennis.com/players/329659/manuela-zegarra-ball-n => Manuela Zegarra-Ballón

The invalid UTF8 sequence is here : e3 93 4e But I think it's in reallity e3 93 - Because 4e is the correct "N" at the end of the name (don't know why it's uppercase) The name contain an o with accent that match with c3 93 (in uppercase)

I suppose the problem comes from multiple encoding operation/conversion with wrong encoding choice (maybe e3 comes from c3 after transform to lower case in latin1 so the original lower('Ó') do lower(c3) lower(93) => e3 93 instead of lower(c3 93) => ó)

There is many other invalid seuence of utf8 in this file

JeffSackmann commented 3 years ago

fixed