BuzzFeedNews / 2016-01-tennis-betting-analysis

Methodology and code supporting the BuzzFeed News/BBC article, "The Tennis Racket," published Jan. 17, 2016.
http://www.buzzfeed.com/heidiblake/the-tennis-racket
249 stars 62 forks source link

Data can be trivially de-anonymised #1

Open timbennett opened 8 years ago

timbennett commented 8 years ago

The data and analysis description contain enough information to de-anonymise players, matches and bookmakers. I will not disclose the method but the repo maintainer can contact me via email should you wish to check. The horse has probably bolted on mitigating this issue.

ppaulojr commented 8 years ago

I thought the same. Any journalist with script knowledge and a little patience could de-anonymise players.I don't see any mitigation strategy at this point.

jaypinho commented 8 years ago

@timbennett @ppaulojr This would be true, except for the fact that the dataset used by Buzzfeed to produce this study is extremely vague. Nowhere in the article, this repo, or in the supplementary piece are the criteria for match selection fully detailed.

The closest we get is a reference to a list of 25,993 matches (as mentioned in the README). But other than specifying that this includes ATP and Grand Slam matches in the 2009-2015 period, we know little else about how this data was collected.

After taking the file and aggregating individual player wins and losses by year, the only conclusion I arrived at with reasonable certainty is the true identity of anonymized ID 2ed14b47b1c58532b757d76404dcf1a114b712e50193f0b0a5a05f52e3067134. The others' W-L records were (at times) similar to publicly available W-L data, but (at least in the few hours I spent on this) not immediately verifiable.

The lack of clarity around the dataset begs the question of what matches were included versus which ones were left out. I was unable to discern any consistent criteria.