Open dlb8685 opened 4 years ago
Yes. If you look around in the threads, everyone agrees with you. We have moved beyond first digits and tried to look at second digits, but this is also questionable for several reasons.
There's also a few threads that search for other statistical anomalies in the election data.
The README.md file should be updated.
Some more graphs are generated at https://github.com/paras20xx/benford-law-2020-us-election#milwaukee-wisconsin-sample-size-478 with different bases.
For other bases (3 to 10) there are odd patterns for Milwaukee (Wisconsin) and Chicago (Illinois).
For Washington, the pattern when checked through multiple bases seems fine.
For various other locations, graphs are there, but the sample size is not big enough.
For Washington, the pattern when checked through multiple bases seems fine.
What do you mean by "seems fine"?
@markr-github
If you go to https://github.com/paras20xx/benford-law-2020-us-election/tree/main/dump/washington/vote-count and see all the graphs, the pattern for Washington seems fine (as in no out-of-ordinary patterns).
For other locations, like: Chicago, Milwaukee and San Francisco, the data has discrepancy across different bases (Expand the graphs at https://github.com/paras20xx/benford-law-2020-us-election).
@paras20xx Please see the first post in this thread for an explanation to the "anomalies": https://github.com/cjph8914/2020_benfords/issues/9
In addition: https://github.com/cjph8914/2020_benfords/issues/36 https://github.com/cjph8914/2020_benfords/issues/35 https://github.com/cjph8914/2020_benfords/issues/30 https://github.com/cjph8914/2020_benfords/issues/17 https://github.com/cjph8914/2020_benfords/issues/16 https://github.com/cjph8914/2020_benfords/issues/11
@paras20xx
I think you're referring to how Biden's numbers don't follow Benford's when you look at similar-sized precincts in areas that he won. You'd need to show this is "out-of-ordinary", when it's what I'd expect from competitive races when most precincts are around 500. See the response by @testes-t
Further UK results, showing how in the 2015 general election the Conservative votes don't follow Benford's rule, and in the 2016 Brexit referendum neither side did. This is just what close election results with similar-sized voting areas look like and is not evidence of fraud by Conservatives or pro-Brexit people.
It's the same story with second-digit analysis, except perhaps in a less pronounced way. Benford's Law should not be expected to be valid whenever the numbers follow a distribution with an expected value above 0 (or, for non-zero values, above 1). For instance, if the expected value is 15 +/- 1, then the second digit will always be 4, 5 or 6. And if the variance is larger, there will still be an overrepresentation of fives.
Next, one can ask what sometimes causes the expected value to be above zero, and at other times at zero. That's a funny philosophical question. For the case of elections with two candidates, if N is the number of votes cast, then candidate 1 will get x votes and candidate 2 will get (N-x) votes. There is thus a contingency between the two, which means that the distribution of votes cast for each candidate cannot just take on any form independently of the other candidate.
@dlb8685 These are good insights, thanks. I don't think a lot of people here have thought much about election data until about a week ago. These things take time to dig into, and people have full-time jobs and families, so it really takes a group effort.
Please see #31 A deeper dive shows some funny results that could use some insights
Specifically, why is the Biden Election Day vote distribution in Allegheny County, PA so nearly Gaussian, and , specifically, why in the districts that Trump won.
This case seems of particular interest since there are claims of voter fraud in Allegheny County, PA
@markr-github Your insights are also very helpful, Let me ask, do you see near-gaussian behavior typical of other elections you have looked at when Benford's Law is violated, (and where there is enough data so that the tail can be seen)?
the distribution of votes cast for each candidate cannot just take on any form independently of the other candidate.
@testes-t That's a very good point
Gaussian is expected if most vote counts are between 10^x and 10^(x+1), as that makes the first-digit histogram degrade to a histogram of the data itself.
E.g. if we have most vote counts between 100 and 1000, “first digit = 1” basically means “100 ≤ vote count < 200” and so on.
The same applies for other bases, the key is that the variance isn’t high enough.
@flying-sheep Yes, but I'm am getting at something else.
To me, it's not the vote counts that matter, it's the size of the districts where the vote counts can occur. When the districts are small, like in voting, we can't always apply Benford's Lqw. But I suspect we can stil say something.
In the case I show, the Trump data is heavy-tailed, both in 2020 and in 2016, indicating that there are voting districts large enough to support a large number of votes for either candidate. Moreover, the 2016 Hilary counts show that there are districts with enough democrats to have large vote counts. (yes, in 2016, the dem mean and variance are higher). But what does this mean ? -- for some reason, in 2020, Biden shows a historically unusually small number of high turnout in these larger districts.
Why are the Bident Election Day vote counts what they are ?
A naive ballot-stuffer might select a random number of ballots to stuff for each district, to make it look natural. But I am saying that natural distributions usually look like a random Pareto distribution, not a random Gaussian distribution. That's why Benford works in other cases. And that's the red flag-- not enough large voter turnout for Biden anywhere in the data.
Now maybe the option for mail-in balloting changed things so much it looks odd like this. But it did not change the data in, say, Chicago. Biden's Chicago data is non-Benford, but it is heavy-tailed / Pareto.
@flying-sheep see #31. The other issue
Why does the Biden Election day data show a high mean in districts where Trump won ?
That is, it seems that the Biden data in the Trump-winning districts should look Benford-like Instead, the data looks Gaussian
@charlesmartin14 I don't get it, don't you need to show that the numbers should follow Benford before there's any reason to be suspicious about wards in Pittsburgh or Milwaukee returning a reasonable number of Democratic votes?
This is basically saying: there are not many urban districts where Trump blew out Biden with 60-80 % of the vote. That sounds like reality.
Here are the Brexit results split by "Leave" and "Remain" voting areas. Both options were pretty competitive in most areas.
@markr-github I would say "could follow Benford ..." And it when it could, it should.
So we need to test our test to make sure we can apply it
I'm saying is that natural data should be heavy-tailed. Benford's Law is 1 test for that. It's not the only test. And not being Benford is not enough to stop testing.
These are the tests I would take:
test 1: could it be Benford ? In some cases, the districts may be so small that it is impossible for the data to be Benford. This should be added to the code to check--is Benford's Law even a reasonable test here ?
And this needs to be tested on the size of the voting districts and /or the # of registered voters. But not on the observed counts. The whole point is that we suspect the vote counts are fraudulent, so we can't test test on suspect data.
(Many of the criticisms of this analysis/repo, which is quite well known now, are valid because the test is being applied in cases it should not. )
test 2 could it be Heavy-Tailed ? So I would then test to see that the data could be heavy-tailed, or are the bins (voting districts) so small that the data will always look say Poisson . That's what I have done here, by comparing Biden's data to both Trump's and HIlary's 2016 data. I could probably do a better job here but I'd like to flush out the logic first.
applying the tests. when it could, it should. Finally, if the data can be Benford, or at least heavy-tailed, I then argue, when it could, it should. Here, that means, the data should be heavy-tailed, and / or not perfectly (or nearly) Gaussian
(Perfect anything is odd in real-world data. In studies of the Russian elections, the data is so perfect that researchers have suggested the Russians gamed it to fool them)
testing the vote distributions
Notice I say may. This needs to be done in a lot of cases, exactly like what you are doing.
final smell test. Am I crazy ? and then we need to look at the details and ask, does this result pass a smell test. That is, are our suspicions reasonable, or do they stink ?
BTW, here are some motivating references:
This repo's use of Benford's Law is so misleading that it discredits other claims of fraud more generally.
2020 Milwaukee Data 2018 House Election Data
If the size of the district you're looking at is the same across many different races, the results will skew towards something completely different from Benford's Law, absent any fraud whatsoever. Thus, these results from Milwaukee or any other place provide no evidence of election fraud.