elishayer / mRchmadness

NCAA men's basketball data scraping and bracketology R package
18 stars 10 forks source link

implement homer bias adjustment #13

Closed saberpowers closed 6 years ago

saberpowers commented 6 years ago

This is more of a hand-wavy/rule of thumb adjustment based on the research presented here: https://www.cbssports.com/college-basketball/news/homer-bias-is-real-and-it-will-derail-your-march-madness-bracket/

Users should be allowed to select "hometown" teams, leading to changes to the population pick distribution.

beamandrew commented 6 years ago

I would love to see this too.

saberpowers commented 6 years ago

Thanks! Will prioritize.

saberpowers commented 6 years ago

Here's the plan. From round advancement probabilities, get probability of winning in each round, conditional on reaching that round. Define the home team bias as increasing that conditional log-odds by +.75 (chosen to match the sparse-detail results published by Brad Null), then recombine conditional probabilities in each round to get probability of reaching each round. This procedure was chosen in an attempt to very roughly mimic the Null's results: image Applying this bias to the population picks from the 2017 men's tournament, we have roughly matched the results Null presented: image Note that the long-shots have a bit less of a bias ratio with our math, but this is partly explained by the fact that 0.1% was the lowest frequency reported by ESPN of a team being picked to win the national championship, and Null defined long-shots as having less than 0.1% pick frequency. Hence these long-shots are not as long in our math as Null's long-shots.

Formulating the homer bias this way has an appealing interpretation. In the biased fan's mental model, their team is better by +.75 in terms of the estimated team strength coeffcient in the Bradley-Terry model.

Here's the hastily written code used to produce the second figure above:

`%>%` = dplyr::`%>%`

cumulative = mRchmadness::pred.pop.men.2017 %>% tibble::as.tibble()

cumulative$type = ifelse(cumulative$round6 > .05, 'favorite',
  ifelse(cumulative$round6 <= .001, 'longshot', 'middle'))

conditional = cumulative
conditional$round2 = cumulative$round2 / cumulative$round1
conditional$round3 = cumulative$round3 / cumulative$round2
conditional$round4 = cumulative$round4 / cumulative$round3
conditional$round5 = cumulative$round5 / cumulative$round4
conditional$round6 = cumulative$round6 / cumulative$round5
conditional$round6[conditional$round6 == 1] = .999

bias = function(p, k = .7) {
  pmin(.999, exp(log(p / (1 - p)) + k) /
    (1 + exp(log(p / (1 - p)) + k)))
}

conditional_bias = conditional %>% dplyr::mutate(
  round1 = bias(round1),
  round2 = bias(round2),
  round3 = bias(round3),
  round4 = bias(round4),
  round5 = bias(round5),
  round6 = bias(round6))

cumulative_bias = conditional_bias
cumulative_bias$round2 = cumulative_bias$round1 * conditional_bias$round2
cumulative_bias$round3 = cumulative_bias$round2 * conditional_bias$round3
cumulative_bias$round4 = cumulative_bias$round3 * conditional_bias$round4
cumulative_bias$round5 = cumulative_bias$round4 * conditional_bias$round5
cumulative_bias$round6 = cumulative_bias$round5 * conditional_bias$round6

prob = cumulative %>%
  dplyr::select(-name) %>%
  dplyr::group_by(type) %>%
  dplyr::summarize_all(mean) %>%
  dplyr::select(-type) %>%
  as.matrix

prob_bias = cumulative_bias %>%
  dplyr::select(-name) %>%
  dplyr::group_by(type) %>%
  dplyr::summarize_all(mean) %>%
  dplyr::select(-type) %>%
  as.matrix

matplot(t(prob_bias / prob), type = 'l', ylab = 'Avg. Bias Ratio',
  ylim = c(0, 15), axes = FALSE, lwd = 2, lty = 1,
  col = c('forestgreen', 'dodgerblue', 'darkorange'))
axis(2, at = c(0, 5, 10, 15), labels = c('0%', '500%', '1000%', '1500%'))
axis(1, at = 1:6, labels = c('R32', 'S16', 'E8', 'F4', 'Final', 'Champ'))
legend('topleft', c('longshots', 'middle', 'favorites'),
  col = c('dodgerblue', 'darkorange', 'forestgreen'), lwd = 2, bty = 'n')
beamandrew commented 6 years ago

So how would you incorporate this into the mRchmadness workflow? I don't see how to supply this information into pool.source (which takes a string) in the find.bracket() function. I want to use the biased probabilities to simulate my pool right?

saberpowers commented 6 years ago

I'm almost done with a add.home.bias function that will take a character vector of home teams and return a modified pred.pop.[league].[year] with the probabilities increased for the home teams (which will frequently be a length-1 vector) according to the formulation above. Once that's done, I'll add a home.teams argument to find.bracket and test.bracket, and each will call add.home.bias when the home.teams argument is not NULL. You can expect this functionality to be in the 1.0.3 release at the end of today.

beamandrew commented 6 years ago

You're the real MVP this year!

saberpowers commented 6 years ago

:heart: