alex / nyt-2020-election-scraper

https://alex.github.io/nyt-2020-election-scraper/battleground-state-changes.html
MIT License
1.76k stars 289 forks source link

Changes to hurdle calculation #367

Closed saleemrashid closed 3 years ago

saleemrashid commented 3 years ago
Motivation

The hurdle calculation tries to take third-party votes into account. This was first raised in #194, and was implemented in #200.

However, it calculates the number of votes required for the hurdle using only "relevant" remaining votes (i.e. excluding third-party votes). But it then confusingly divides this by total remaining votes (i.e. including third-party votes).

This effectively makes the metric "what percentage of the remaining votes does the trailing candidate need, assuming the other votes are divided up between the leading candidate and the third-party candidates" which is unexpected and misleading (it makes it look like the trailing candidate has a lower hurdle to clear than they actually do).

A less confusing method is to simply divide by "relevant" remaining votes (i.e. excluding the third-party votes) instead of by total remaining votes. This still takes into account the third-party votes (i.e. "since some remaining votes will be third-party, the trailing candidate needs a higher percentage of future votes to close the gap than if there were no third-party") but gives a more expected hurdle percentage.

Note that this still doesn't fully fix the confusion when comparing to the batch breakdown (since a given batch will include more or less third-party votes than on average), but that's not really possible for us to fix with the data we have.

Changes

The first commit implements the change to the hurdle calculation described above, and also amends the hurdle tooltip to more explicitly explain how the calculation works.

The second commit adjusts the way the proportion of third-party votes is calculated. As the commit message states, we should use the latest data for calculating that proportion (due to Law of Large Numbers), though for our current data:

In practice, this change makes a minute difference. The largest change for the current data was a 0.276 percentage point difference in the proportion of third-party votes, and a ~0.1 percentage point difference in calculated hurdle.

eebasso commented 3 years ago

I think the correct formula should be

hurdle = (vote_diff votes / ((candidate1_votes + candidate2_votes)votes_remaining) + 1 ) / 2 if votes_remaining > 0 else 0

based on the following work Two Party Hurdle Formula.pdf

saleemrashid commented 3 years ago

@eebasso It looks like your equation is identical to the one implemented in this pull request, except you've taken out the common factor. So, instead of the (vote_diff + votes_remaining_relevant) / (2 * votes_remaining_relevant) in this pull request, your equation is (vote_diff / votes_remaining_relevant + 1) / 2.

What are your thoughts on this?

eebasso commented 3 years ago

As long as votes_remaining_relevant is implemented roughly as

votes_remaining_relevant = votes_remaining * (Candidate1 + Candidate2) / (total votes)

Then it should be equivalent I think. I'll check that now

fractionalhare commented 3 years ago

That's correct. This PR uses:

votes_remaining_relevant = votes_remaining * (candidate1_votes + candidate2_votes) / votes

eebasso commented 3 years ago

Great! I think this will lead to a more accurate reflection of the two party hurdle needed for each candidate. For example, Trump's hurdle in Arizona will be higher and show why Biden is on track to win the state. The hurdle needed should uniformly increase from where it was before because we are effectively multiplying the old formula by the ratio

votes_remaining /relevant_votes_remaining,

which is always greater 1