alan-turing-institute / TuringDataStories

TuringDataStories: An open community creating “Data Stories”: A mix of open data, code, narrative 💬, visuals 📊📈 and knowledge 🧠 to help understand the world around us.
Other
39 stars 12 forks source link

Baseball Story #126

Closed edaub closed 3 years ago

edaub commented 3 years ago

Summary

Analysis using the Lahman Baseball Database to try and quantify what "replacement level" is for major league hitters using Bayesian Inference. Replacement level refers to some hypothetical player that is cheap and easy to find, and then trying to determine how much better a given player is than this hypothetical one. Replacement level for hitters is usually estimated as 80% of the league average (see for instance Wikipedia), but it's unclear what this actually means in precise statistical terms. This story goes into detail using a Bayesian model to quantify this and estimates the level over the course of baseball history.

Fixes #122

Potentially good reviewers might be @jack89roberts or @nbarlowATI

List of changes proposed in this PR (pull-request)

What should a reviewer concentrate their feedback on?

Acknowledging contributors

review-notebook-app[bot] commented 3 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

jack89roberts commented 3 years ago

Nice, I'm happy to take a look through 🙂

nbarlowATI commented 3 years ago

I'm also very happy to review!

crangelsmith commented 3 years ago

Review issue created (#127) and assigned to @jack89roberts and @nbarlowATI.

crangelsmith commented 3 years ago

@all-contributors please add @jack89roberts for review

allcontributors[bot] commented 3 years ago

@crangelsmith

I've put up a pull request to add @jack89roberts! :tada:

edaub commented 3 years ago

Thanks for the comments, @jack89roberts. I think I've addressed most of them in an edit of the story; see below for specfics.

  • Baseball in general
    • A basic diagram of a baseball field/diamond may be helpful (annotated with terms used in the analysis like S, 2B, 3B, HR where possible)?
    • The definitions are in the text and the data dictionary, but at times I would have found a glossary of terms helpful, mostly of things like AB, BB, HBP, SH, SF, PA etc. that appear either in queries or data frames (maybe it could go in the data section at the beginning).

I added a "Primer on Baseball" section to try and describe this a bit better. I'll see about adding an illustration of a baseball diamond. I also included a table of the stats plus my derived stats to try and help clarify things as suggested.

  • SQL queries
    • I think you do a good job of explaining what the queries are doing. The one thing I find a bit tricky is the calculation of fields in rc_base_query, e.g. it may not be obvious why O = AB + SH - SF - H.

I added a bit in the table above to describe why this is the number of outs a player made (outs is at bats minus hits, but you need to add in the sacrifices as those are outs that are not counted as at bats)

  • Runs created:
    • I'm not sure I fully understand why this is a good metric/the way it is defined. Naively, I'd expect it to be TB / 4, or TB / 4*PA if it was a rate stat rather than a counting stat. What's the purpose of the success rate term?

I've added a bit more discussion of this to try and improve intution, but ultimately I think the answer is that this is mostly empirical and it seems to work. I've walked the user through an example, but provided the caveat that this is not actual production, but should be thought of as a typical result when averaging over all possible scenarios.

  • My instinct was that TB counts only the number of bases the batter advanced from their initial hits (not bases advanced by teammates at the same time, or bases subsequently gained during teammates' hits). Is that correct? What made me unsure were notes like the ones below, as I'm not sure how they influence RC unless TB is incorporating more than I thought:
    • hits tend to advance any teammates already on the bases a different number of bases

    • Extending this to be more accurate requires factoring in base running

    • being more precise in how the runs each team scores can be attributed to each player (i.e. the "a player cannot be on base for themselves" effect).

      • Edit: On reflection, is one nuance that a hitter might have slower teammates on bases in front of them, which limits how many bases they can advance?

You are mostly right in all of this. Total bases is only from a player's hits, not any subsequent advancement or any advancement made by teammates (I've tried to clarify this).

  • Did I cover the Bayesian methods adequately?
    • At the start of "Multinomial Bayesian Interface" is there maybe a more layman's terms way the posterior could be introduced/defined (currently: "and seek to compute the posterior probability conditioned on the data")?

I tried to use simpler terms to describe what we are doing here. Hopefully this is clearer.

  • Is it worth discussing why the prior you've selected is reasonable a bit more (or revisiting that later when your results show it was reasonable)?

I added some clarification that these are subjective priors based on my personal understanding of the game of baseball. I also added a sentence about how this prior is probably not influencing the final results in the end.

  • Simulating League Seasons
    • Could you add a legend to the plot in "Simulating League Seasons" comparing the simulated and empirical distributions?
    • Theres a sentence that's not finished:
    • Others within that peak near zero have not provided evidence that they are good, so these are not useful in evaluating if

Done, and fixed that sentence.

  • Gaussian mixture model:
    • At first I didn't fully comprehend why the GMM was needed; my instinct was you could have just calculated the 2.5% quantile directly from the sample. What I missed is that you're calculating the 2.5% quantile of the second Gaussian only, not the full distribution. Maybe this could be emphasised a bit more in the text?

I tried to be more explicit as to why we need to separate out the two distributions.

  • Replacement runs created:
    • In the summary at the end, perhaps you could comment on how replacement RC could be used by a team, e.g. to identify over-/undervalued players, I think? Maybe out of scope for the story, but could you pick out a player that's much better/worse than their reputation suggests, for example?

I think I could definitely do this, but I felt this was already a bit long, so I think I'm going to leave it as it is. That seems to be a good idea for a follow-on, though!

  • Is my coding style reasonable and understandable?
    • In runs_created should line 12 df.ndim == 1 be season_stats.ndim == 1 (or if not should df be an argument to that function rather than relying on using the variable defined earlier in the notebook)?

Yes, thanks for catching that.

  • In project_player_season you call an argument prior instead of priors which is used in other places/functions. I think the prior in project_player_seasons is identical to the priors used elsewhere, so maybe it could be changed to priors as well to be consistent?

Yep, agree that is clearer.

  • References/links:
    • It might be nice to add a few more links to wider context around baseball and baseball stats, including to things mentioned in the text like "Moneyball by Michael Lewis", "legendary baseball statistician Bill James", "dead ball" era prior to 1920", "the late 1990's shows a spike during the 'steriod' era", and also to terms related to the analysis like "Beta distribution", "Dirichlet distribution" etc.

I've added some links to wikipedia pages for context.

edaub commented 3 years ago

Thanks @nbarlowATI for the comments. Some overlap with Jack's, but I note those changes here too. Let me know if I missed anything or need to go into more detail anywhere.

  • Baseball is not as well known in the UK as in the US, Latin America, and Japan. Did I clarify enough about the game and what the statistics that I used mean?

This is tricky... I can definitely see the risk of giving more detail, and burying the reader in too much information. However, leaving out some detail does risk things becoming confusing. For example, I think the difference between "At Bat" and "Plate Appearance" is not really obvious here without looking at the SQL query.

I would be a bit tempted to go all-in on the basic explanation. Something like:

  • Every time a batter comes to the plate to face the pitcher, this is a "Plate Appearance"
  • The pitcher will throw the ball, and one of the following will happen:
    • Batter hit by pitch (HBP) - advances to first base.
    • Batter doesn't swing at the ball, ball is outside the "strike zone" - this is a "ball". Pitcher will throw again. If there are four balls in a Plate Appearance, this is a "Walk" (or "base-on-balls") (BB), batter advances to first base.
    • Batter doesn't swing at the ball, ball is inside the "strike zone", OR the batter swings at the ball and misses - this is a "Strike". Pitcher will throw again. If there are three Strikes in a Plate Appearance, the batter is out. etc. etc. Could also then mention "batting average", as this is, as far as I can see, the main motivation for having "At Bat" as a separate stat from "Plate Appearance" ?

Maybe this is too much information though - it's definitely a matter of opinion, so I wouldn't argue the case too strongly.

I tried to give a primer in a separate section and defined the symbols in a table as well as in the text. Hopefully this gives those that need context something more elaborate to read, while being easy to skip over if the reader doesn't need it. I tried to do what you did above in prose rather than bullet points, so if it still isn't right let me know.

I also didn't quite understand the point about the best players being overvalued because they cannot get on base for themselves... Is this because although they might get to e.g. 2B, the batters behind them will be worse, so this is less likely to result in a Run?

I would have thought that in this case it's not the level of the player relative to the rest of his team that's important, but relative to the whole league? (And couldn't this be countered by the fact that the best teams have lots of good players, so good players are more likely to have good team-mates who can bat them in?)

Runs Created is calibrated based on the assumption that a player has an average number of players on base for their at bats, and the created runs are due to the players hitting only. So while you are correct that this is effect is likely to be there, Runs Created is not trying to capture this.

What I was trying to convey is that the best players get on base more than average, but you can never be on base for yourself, so the best players will have fewer players on base than average when they hit. Thus, their same hits will produce fewer runs than the average person because there are fewer baserunners than average.

Runs Created should really be thought of as "theoretical runs created by a players hits/walks only, assuming an average number of baserunners." More sophisticated approaches try to assign a fraction of every actual run to different players (i.e. a player hits a double, then a player hits a single to score the run, so both players contributed a fraction of that actual run), but we are not doing that here. Maybe I should spell this out in more detail?

In general though I think the explanations are good, but it might be nice to tell the reader in advance that they are coming - e.g. just before Cell 4, a quick sentence saying that all the acronyms will be explained in the following section.

I tried to do this with a table to make look-up easier.

  • [x] Are the SQL queries to get the data not too hard to understand for a novice? The only tricky one is Cell 5, with the "coalesce" commands - the explanation is good, but I'm torn as to whether this should go before or after the cell itself. It might also be good to have a sentence here explaining why this command filters out the pitchers - I'm not sure I quite understand that.

Yes, I will clarify these and move the explanations up to before the command is introduced.

  • [x] Is my approach reasonable? There are a few numbers I tweaked to get the stats to line up well with the standard estimates (percentiles used to quantify a "full season" and how many standard deviations below the average in the final estimate), so if these seem like egregiously unreasonable numbers to use then please let me know.

  • [x] Did I cover the Bayesian methods adequately?

  • Yes, really good explanations I think (and useful code snippets!)

  • [x] Is my coding style reasonable and understandable? I try to write things mostly in functions out of habit, which isn't always straightforward to understand in a literate tool like a notebook.

  • Yes, I think this is excellent, and having everything in functions is useful for any reader who might want to transplant bits and pieces into their own code!

  • I think the plot that is the output of Cell 13, comparing the actual and simulated runs created over a whole season, would benefit from a legend.

  • There's a missing half-sentence just after this: "so these are not useful in evaluating if".

Fixed both of these.

edaub commented 3 years ago

I just pushed some more revisions based on some changes to the analysis. I think I am happy with this version, though if someone would be able to proofread before publishing I would appreciate it! Thanks.

billfinnegan commented 3 years ago

I'll give it a proofread before the end of the week...

billfinnegan commented 3 years ago

Really great data story - the writing style is clear and accessible, and I like how the intro sets up the concept outside of the details of baseball. Here are a few notes based on a proofread:

Let me know if you need me to clarify any of this, and feel free to ignore anything!

billfinnegan commented 3 years ago

In terms of a one-sentence summary, does this work: Using Bayesian methods to explore the concept of a "replacement level" baseball player in terms of runs created compared to the league average.

And if we ever want a slightly longer description: To explore marginal utility - how much additional value you get for paying more for something - you need a baseline, whether you are comparing pastries or professional athletes. In baseball, the baseline is referred to as the replacement level, which is generally thought to be about 80% of the league average. This data story investigates over a century of professional baseball data to calculate the "runs created" by every player, and uses Bayesian methods to determine the replacement level for each year.

edaub commented 3 years ago

Thanks @billfinnegan for the helpful style comments -- I implemented most of them or something in the spirit of what you suggested. I also re-worked a couple of code bits that I thought were inelegant. I think this is ready to publish now!

crangelsmith commented 3 years ago

Fantastic @edaub , thank you so much! We will be publishing it on the page in the next week.

Also, we are working with Comms to advertise the stories on the Turing site and they are asking for a one-sentece summary and a longer description. @billfinnegan has kindly agreed to write it for us:

In terms of a one-sentence summary, does this work: Using Bayesian methods to explore the concept of a "replacement level" baseball player in terms of runs created compared to the league average.

And if we ever want a slightly longer description: To explore marginal utility - how much additional value you get for paying more for something - you need a baseline, whether you are comparing pastries or professional athletes. In baseball, the baseline is referred to as the replacement level, which is generally thought to be about 80% of the league average. This data story investigates over a century of professional baseball data to calculate the "runs created" by every player, and uses Bayesian methods to determine the replacement level for each year.

Are you happy with this summary @edaub ? If so I'll send it to Comms.