alan-turing-institute / TuringDataStories

TuringDataStories: An open community creating “Data Stories”: A mix of open data, code, narrative 💬, visuals 📊📈 and knowledge 🧠 to help understand the world around us.
Other
39 stars 12 forks source link

[Review] [Turing Data Story] Baseball story #127

Closed crangelsmith closed 2 years ago

crangelsmith commented 3 years ago

Story Review:

Story Name: Baseball story

Submitting Author: @edaub (Eric Daub)

Pull Request: #126

Reviewers: @jack89roberts @nbarlowATI

Reviewer instructions & questions

@jack89roberts, @nbarlowATI, please carry out your review in this issue by updating the checklist below (you can copy it in a comment and fill it individually), and writing new comments in case you have any questions. If you cannot edit the checklist please:

Any questions, concerns or suggestions regarding the review process please let @crangelsmith, @DavidBeavan or @samvanstroud know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review Checklist

Code of conduct

General checks

Reproducibility

Pedagogy

Context

Ethical

AOB

jack89roberts commented 3 years ago

Code of conduct

General checks

Reproducibility

Pedagogy

Context

Ethical

AOB

My answer for all of these is yes 😄 , but I'll post some more detailed comments and thoughts below that could be integrated.

jack89roberts commented 3 years ago

Great job @edaub, I really enjoyed reading the story! In response to your specific questions I believe the approach looks reasonable and you definitely do a good job explaining baseball, the queries and the Bayesian methods. I have written some comments/thoughts below on a few bits and pieces I either struggled to get my head around or thought could be expanded upon, should you wish to do so (sorry it got quite long!) Nice work!

edaub commented 3 years ago

Thanks for the helpful comments, @jack89roberts. I won't get back to this for a while, but I'll be sure to address these in May when I return to make revisions!

nbarlowATI commented 3 years ago

Excellent work @edaub - I really enjoyed reading this (even if I disagree with the initial croissant premise - see below! :) ) and found it very well explained, interesting and informative!

Review Checklist

Code of conduct

General checks

Reproducibility

Encountered problems while solving. Problem: nothing provides requested sklearn


- [x] Are all data sources openly accessible and properly cited with a link?
- [x] Are the data [open](https://opendatahandbook.org/guide/en/what-is-open-data/), and do they have an explicit licence, provenance and attribution?

#### Pedagogy
- [x] Does the story demonstrate some specific data analysis or visualisation techniques?
- [x] Are these techniques well motivated?
- [x] Are these techniques well implemented?
- [x] Is the notebook well documented, using both markdown cells and comments in code cells?
- [x] Does the notebook has a introduction section motivating the story?
- [x] Does the notebook has a conclusion section discussing the main insight from the stories?
- [x] Is the paper well written (it does not require editing for structure, language, or writing quality)?

#### Context
- [x] Does the story give an insight into some societal issue?
- I think it is reasonable to describe a popular professional sport as a societal issue.
- [x] Is the context around this issue well referenced (newspaper articles, scientific papers, etc.)?
- Not so many references, but I think it is well enough explained and motivated.

#### Ethical 

- [x] Is any linkage of datasets in the story unlikely to lead to an increased risk of the personal identification of individuals?
- [x] Is the Story truthful and clear about any limitations of the analysis (and potential biases in data)?
- [x] Is the Story unlikely to lead to negative social outcomes, such as (but not limited to) increasing discrimination or injustice?

## AOB
- Baseball is not as well known in the UK as in the US, Latin America, and Japan. Did I clarify enough about the game and what the statistics that I used mean?

This is tricky...  I can definitely see the risk of giving more detail, and burying the reader in too much information.  However, leaving out some detail does risk things becoming confusing.  For example, I think the difference between "At Bat" and "Plate Appearance" is not really obvious here without looking at the SQL query.  

I would be a bit tempted to go all-in on the basic explanation.  Something like:
 - Every time a batter comes to the plate to face the pitcher, this is a "Plate Appearance"
 - The pitcher will throw the ball, and one of the following will happen:
    - Batter hit by pitch (HBP) - advances to first base.
    - Batter doesn't swing at the ball, ball is outside the "strike zone" - this is a "ball".  Pitcher will throw again. If there are four balls in a Plate Appearance, this is a "Walk" (or "base-on-balls") (BB), batter advances to first base.
    - Batter doesn't swing at the ball, ball is inside the "strike zone", OR the batter swings at the ball and misses - this is a "Strike".  Pitcher will throw again.   If there are three Strikes in a Plate Appearance, the batter is out.
etc. etc.   Could also then mention "batting average", as this is, as far as I can see, the main motivation for having "At Bat" as a separate stat from "Plate Appearance" ?

Maybe this is too much information though - it's definitely a matter of opinion, so I wouldn't argue the case too strongly.

I also didn't quite understand the point about the best players being overvalued because they cannot get on base for themselves...   Is this because although they might get to e.g. 2B, the batters behind them will be worse, so this is less likely to result in a Run?
I would have thought that in this case it's not the level of the player relative to the rest of his team that's important, but relative to the whole league?  (And couldn't this be countered by the fact that the best teams have lots of good players, so good players are more likely to have good team-mates who can bat them in?)

In general though I think the explanations are good, but it might be nice to tell the reader in advance that they are coming - e.g. just before Cell 4, a quick sentence saying that all the acronyms will be explained in the following section.

- [x] Are the SQL queries to get the data not too hard to understand for a novice?
The only tricky one is Cell 5, with the "coalesce" commands - the explanation is good, but I'm torn as to whether this should go before or after the cell itself.    It might also be good to have a sentence here explaining why this command filters out the pitchers - I'm not sure I quite understand that.

- [x] Is my approach reasonable? There are a few numbers I tweaked to get the stats to line up well with the standard estimates (percentiles used to quantify a "full season" and how many standard deviations below the average in the final estimate), so if these seem like egregiously unreasonable numbers to use then please let me know.
- [x] Did I cover the Bayesian methods adequately?
- Yes, really good explanations I think (and useful code snippets!)
- [x] Is my coding style reasonable and understandable? I try to write things mostly in functions out of habit, which isn't always straightforward to understand in a literate tool like a notebook.
- Yes, I think this is excellent, and having everything in functions is useful for any reader who might want to transplant bits and pieces into their own code!
- [x] Everything looks ok?

- I think the plot that is the output of Cell 13, comparing the actual and simulated runs created over a whole season, would benefit from a legend.
- There's a missing half-sentence just after this: "so these are not useful in evaluating if".

## AOAOB
I actually think that Pret A Manger croissants (the basic "all butter" ones) are better than pretty much any other croissants one can find outside of France!
edaub commented 3 years ago

@nbarlowATI Thanks for the helpful comments -- and agreed that if Pret crossaints are replacement level then we're doing pretty well here 😄

crangelsmith commented 3 years ago

I think this story is almost ready to be published. @edaub, @billfinnegan let you some final comments after a proof read he did on it. If you are implementing any more changes based on the comment let us know when is ready and we'll do the rest.

billfinnegan commented 3 years ago

Sorry - put my notes in #126 - pasted again below...

Really great data story - the writing style is clear and accessible, and I like how the intro sets up the concept outside of the details of baseball. Here are a few notes based on a proofread:

Let me know if you need me to clarify any of this, and feel free to ignore anything!