alan-turing-institute / TuringDataStories

TuringDataStories: An open community creating “Data Stories”: A mix of open data, code, narrative 💬, visuals 📊📈 and knowledge 🧠 to help understand the world around us.
Other
40 stars 14 forks source link

[Turing Data Story] Baseball Statistics #122

Closed edaub closed 2 years ago

edaub commented 3 years ago

Story description

Given my love of baseball and statistics (my dog-eared copy of Moneyball is still a favorite to re-read from time to time) I am interested in writing a story using baseball statistics. Still thinking about the exact details, but it will use the Lahman Database, so for the moment this is a placeholder to record my interest.

Ethical guideline

Ideally a Turing Data Story has these properties and follows the 5 safes framework.

Current status

Updates

crangelsmith commented 3 years ago

This sounds amazing @edaub 🤩. Please go ahead when you have the time and let us know if we can help in any way.

edaub commented 3 years ago

Update following today's discussion with @jack89roberts at the REG Tech Talk:

I spent the afternoon digging more into the SQL database holding the stats. I extracted stats for several years, converted into a more useful form (baseball has this strange definition of an at-bat where they don't count all of the times you come up depending on if certain outcomes happen, which is decidedly not the correct way to handle it), and then fit some Bayesian models based on a multinomial distribution to estimate the league average. The idea is to do some kind of Bayesian analysis to measure marginal utility (also known as "replacement level" in baseball parlance) with some kind of uncertainty bounds, and compare players to replacement level/compare replacement levels across different eras. Fortunately at least in simple cases this can be done analytically, so hopefully I'll be able to avoid needing to do any heavy computation in this analysis.

This is fairly standard stuff for baseball stat heads, but there doesn't seem to be much usage of Bayesian inference in the popular (or even academic) literature so I think that this is somewhat of a fairly novel approach.

edaub commented 3 years ago

PR #126 is now ready for review!

edaub commented 3 years ago

Regarding personal data, I believe that baseball statistics, though identifiable to a particular person, aren't personal data as the games are broadcast to the public. Probably worth double checking that the ethics folks agree with this!