cjph8914 / 2020_benfords

369 stars 83 forks source link

Failure to take account for external factors #5

Open sfxworks opened 3 years ago

sfxworks commented 3 years ago

Don't get me wrong; this looks well written. However, this could do for a PR with such notices. I'd be happy to contribute one to the readme.

PendragonDay commented 3 years ago

The data for the Libertarian candidate paints the picture.

sfxworks commented 3 years ago

@PendragonDay I would argue that that isn't a large enough sample size / accurate correlation for a control group for a number of reasons. For example, a vote includes an individual's will to see the voter's ideal come to life. Unfortunentally a lot of voters do not go for minority parties even if they align with their ideals due to their failure rate and side with a major party as an alternative.

ghost commented 3 years ago

The most important external factor is precinct size, which is unlikely to be a random number in large cities.

I just ran a simple Benford analysis in Excel using data from the 2019 Norwegian election. The results of my analysis looked very suspicious for the county of Oslo. However, looking more closely at the data it turned out that the precincts quite consistently supported 2000-8000 eligible voters (typically, people vote at the premises of a local school). Precinct sizes in Oslo County thus do not follow Benford's Law.

A party that gained about 10% support in most precincts thus quite consistently received 200-800 votes in each precinct. Their results therefore turned out to be very much in breach of Benford's Law, but there was an easy explanation as to why.

For Norway as a whole, however, precincts sizes follow Benford's Law to a much greater degree. One would assume that there is a city-countryside divide due to the planned size of precincts in cities, and the "organic" size of precincts in the countryside. The planning has caused precinct sizes to violate Benford's Law in Oslo, and may well do so in the cities you are investigating in the US as well.

So in order to improve transparency for the analysis it's important to add another diagram that shows the distribution of first digits of the precinct sizes in the county, as compared to the distribution of Bentford's Law, for each county being analyzed. Otherwise the analysis would at least be misleading for the case that I described above.

Overall, I am not convinced that Benford's Law can be relevant to precinct results without further statistical adjustment.

ghost commented 3 years ago

To be more specific, if you look at the Biden diagram for each of Chicago, Milwaukee and Allegheny, there seems to be a Gaussian distribution with an expected value of about 3-5.

Let me try to explain why this happens: If (1) politicians for practical resasons tend to divide each precinct so that they have an average of 5,000-8,000 voters (e.g. due to the use of public schools as voting venues); if (2) there was a turnout of 80%; and (3) 70% actually voted for Biden, then you would expect Biden results for each precinct at 3,000-4,500 votes to be statistically overrepresented - with a pseudo-Gaussian distribution.

That would explain at least the general trend that we are seeing in the data.

If the assumption (1) holds true, it would as far as I can tell make the use of Benford's Law impractical to the general case. The analysis does show that someone cooked the numbers, but I find it plausible that the precinct sizes were cooked, and it would be difficult to observe anything more beyond that. So I am looking forward to seeing a diagram that details the distribution of precinct sizes in each county in question.

czr137 commented 3 years ago

https://imgur.com/a/LPjXX92

The histograms above show that the voter size distributions will necessarily lead to contradictions with Benford's Law, due to their similarity in sizes.

The votes cast per area do not span the requisite orders of magnitude for Benford's Law to apply. You are outside the use-case and are mis-applying Benford's Law. @testes-t explaination is a good analogy.

ghost commented 3 years ago

As pointed out here: https://github.com/cjph8914/2020_benfords/issues/12

the problem I outline can be fixed by switching to a second numbers test. That would, however, invalidate the results that are currently published in the README file.

Can anyone update the code accordingly? Because a first numbers test can evidently not provide any proof of foul play.

pkit commented 3 years ago

@testes-t I'm not sure what are you implying. One (just one) candidate has different distribution from others. It's called "anomaly" and that's what Benford law is for: detecting anomalies. I see that anomaly was successfully detected. No amount of wordplay will change that.

ghost commented 3 years ago

OK, I agree with you. Discussion and interpretation are also relevant, but seem to be more difficult than I had thought.

pkit commented 3 years ago

@testes-t yup, it also may be a "natural" anomaly, but unless we have more data it is what it is.