fivethirtyeight / data

Data and code behind the articles and graphics at FiveThirtyEight
https://data.fivethirtyeight.com/
Creative Commons Attribution 4.0 International
16.74k stars 10.94k forks source link

mlb_elo heading legend #194

Closed pschloss closed 6 years ago

pschloss commented 6 years ago

I'm trying to figure out what the column headings mean in the mlb_elo.csv file and am running into a problem deciperhing what is in the columns that start with elo* and the rating1_post and rating2_post columns

date: date of game
season: year of season
neutral: whether game was on a neutral site
playoff: whether game was in playoffs
team1: abbreviation for home team
team2: abbreviation for visiting team
elo1_pre:
elo2_pre:
elo_prob1:
elo_prob2:
elo1_post:
elo2_post:
rating1_pre: ELO rating for home team before deductions
rating2_pre: ELO rating for visiting team before deductions
pitcher1: name of home team pitcher
pitcher2: name of visiting team pitcher
pitcher1_rgs: home team pitcher run score
pitcher2_rgs: visiting team pitcher run score
pitcher1_adj: home team pitcher ELO score adjustment
pitcher2_adj: visiting team pitcher ELO score adjustment
rating_prob1: probability of home team winning based on adjusted ELO scores
rating_prob2: probability of visiting team winning based on adjusted ELO scores
rating1_post: ELO rating for home team after game decision
rating2_post: ELO rating for visiting team after game decision
score1: home team score
score2: visiting team score

Looking at the data from the 2018-07-01 CHC vs MIN game the csv files has:

date: 2018-07-01
season: 2018
neutral: 0
playoff: 0
team1: CHC
team2: MIN
elo1_pre:   1548
elo2_pre:   1492
elo_prob1:  0.614
elo_prob2:  0.386
elo1_post:  1549.39538447082
elo2_post:  1490.68996714635
rating1_pre:    1562
rating2_pre:    1491
pitcher1:   Jon Lester 
pitcher2:   Lance Lynn
pitcher1_rgs:   55.1
pitcher2_rgs:   50
pitcher1_adj:   17.1
pitcher2_adj:   -2.43
rating_prob1:   0.659
rating_prob2:   0.341
rating1_post:   1562.80857309143
rating2_post:   1490.58601631338
score1: 11
score2: 10

The elo1_pre and elo2_pre values don't seem to correspond to what is posted on the website. The pregame ELO scores were 1562 and 1491 (i.e. rating1_pre and rating2_pre). I'm not sure where the elo1_pre and elo2_pre came from nor the elo_prob values (i.e. 1/(1+10^((1491-1562)/400)) = 0.601). Any help?

pschloss commented 6 years ago

Thanks for adding the documentation to the README. I'm afraid this doesn't really answer my questions. What is the difference between elo1_pre and rating1_pre1? Also, aside from the numbers inputted to the formulae, is the formula for elo1_prob different from that of rating_prob1? i.e. 1/(1+10^((Rb-Ra)/400

RZachLamberty commented 6 years ago

@pschloss they explain (at least in words, if not full equations) several of the differences on this blog post. there's some overlap between that post and the README, but I think the post is pretty good for a lot of what you're asking.

for example, the "ratings" are described there as a sort of "decorated" Elo score -- they aren't reverted to the mean like Elo scores, and they are tweaked with overall factors to model some pre-game knowledge (home field advantage, travel distance, and starting pitcher quality).

as for the discrepancies in the calculation of probabilities based on those scores, I suspect they are applying the 24 point home field advantage prior to calculating the probability (that is, the formula would instead be

1 / (1 + 10^((Rb - Ra +/- home_field_advantage_pts) / 400)

this is what they do for nfl elo scores, for example. the numbers are closer in that case (61.3 instead of 61.4) leading me to suspect the home field advantage parameter is not exactly 24