cjph8914 / 2020_benfords

368 stars 82 forks source link

Why is the Biden Election Day vote data nearly Gaussian ? #31

Open charlesmartin14 opened 3 years ago

charlesmartin14 commented 3 years ago

The weirdest thing to me about all of this is that the Election Day vote distribution for Biden is almost perfectly Normal, with a slight right skew.

Screen Shot 2020-11-08 at 9 44 15 PM

Whereas the Trump data , being heavy-tailed, just looks more like real-world data to me

Screen Shot 2020-11-08 at 9 44 26 PM

Any thoughts ?

ghost commented 3 years ago

I don't understand why that's potentially suspicious. Which distribution did you expect to see?

wongjiahau commented 3 years ago

@testes-t Benford

frycast commented 3 years ago

@testes-t Benford

Nope, not Benford. Benford is to do with the leading digits in the vote counts, whereas these are the vote counts themselves.

charlesmartin14 commented 3 years ago

@testes-t @frycast

If the voting data is, by its natural inclination, normally distributed, why should only the Biden data be this way, and only for the Election Day data ?

Here is the Absentee data

Screen Shot 2020-11-09 at 6 08 48 AM

Notice that the Biden data is more heavy-tailed (i.e. as a long tail to the right) which is barely visible in the Election Day data. However, the Biden data is almost bi-modal, which is another indicator of potential fraud.

Most natural world data like this is heavy-tailed ? Why ? Think about issue #17. Here, it is argued that Biden is a highly preferred candidate. That's fine. But why would every single voter have the exact same individual preference? In pure math terms, why would the prior on the preference (p) be highly peaked ?

Sure people can behave in similar ways, but, in the aggregate,I argue voters have both their own individual preferences and cluster into diverse peer groups with similar trends. Mathematically, one might say that this broadens the prior on the individual preferences, giving the fatter tailed distributions that then behave Benford-like, even in these small, bounded data sets.

If you ask an Election Forensics researcher (like Mebane), so why isn't the Biden voting data always Benford ? Why are there so many deviations from Benford ? He would argue that this is evidence of some kind of effective election strategy by Biden. That is, this is a kind of herding behavior, where everyone in the sample is behaving (nearly) exactly the same way. And he believes this kind of induced herding behavior is widespread. And he has some simulations he put together to argue for this. He has a talk on this (https://www.youtube.com/watch?v=zkx_eO0PvXU)

So if this is the case in the Milwaukee data that @frycast highlighted, that could be.

But for us to really convince ourselves that we believe this, we should find a way to test this hypothesis in on other metrics in the data, such as looking at the total voting distributions (and the voter turnout data, etc).

Here, it seems odd to me that, given this, that even the underlying induced herding behavior, we would still see a nearly perfect Gaussian distribution, and only for the Election Day data. That just seems odd to me.

This is especially odd to me since we know that, at least in PA, the Trump turnout was near to over 100% (with many unaffiliated voters), whereas the Biden turnout was down ~75%. So why would the Biden election data show signatures of incredibly successful election strategy whereas is Trump data is more 'natural' ?

ghost commented 3 years ago

I have no idea, but it's possible that Republicans are less collectivist.

charlesmartin14 commented 3 years ago

@testes-t. Something like that could be the case. That is, say election workers for the Biden campaign went all around the city (or called during Covid), ensuring people would vote that day, and enough so that they met some kind of internal quota on "how many people called, how many people said yes, etc".

That is, they went out and physically brought them in on Election Day.

This kind of collective action might generate such perfect data.

But it could also be evidence of wide-scale collusion, where every polling place ensured they hit a minimum number of about 100 Biden voters no matter what, with a little random variation thrown in to hide the fraud.

andrewzigerelli commented 3 years ago

@charlesmartin14 Maybe for completeness, do a QQ plot as well. Also various goodness of fit tests, but keep in mind the power if you fail to reject the null.

I don't use python for these tests, but I imagine it's just as easy.

charlesmartin14 commented 3 years ago

@ndrewzigerelli

The issue I have is not that the data would satisfy a p-value or other test like this, but that, structurally, the data is far more Gaussian than heavy-tailed.

That is, if we compare the Biden Election Day vote data to different distributions (i.e using non-parametric K-S test) we would find that the Gaussian is the best fit

Now it could be that I am imagining things, and that the tiny little tail we see (from 200 to 400) is just snall because the data set sample is so small

This plot shows ~half the data (votes > 100), on a log10 scale

Screen Shot 2020-11-09 at 8 16 14 AM

You can see that little 'tail' at the end , above 200, which s about 3.5% of the data set. (In our work we call that a ' heavy-tailed finger' )

The Trump data is structurally completely different--here is a comparative plot on the same scale (density, not total votes)

Screen Shot 2020-11-09 at 8 19 38 AM

I have looked at hundreds of these kinds of plots in my own research on heavy-tailed phenomena, and this is just a case that really caught my attention.

I'm happy to do the test you asked for, and some deeper analysis, and I'll try to get to it after work tonight.

MechanicalTim commented 3 years ago

These two distributions look "structurally completely different" in the same way that a Poisson distribution with lambda=1 looks completely different from one which has lambda=4. The left-hand tail emerges as lambda gets larger -- or in our case as vote share grows -- and it begins to be well approximated by a gaussian. The most recent plot here masks the larger mean value, which is probably an important feature of the data.

Note that I am not claiming these are or "should" be poisson, or approach gaussian.

I just don't see this as something that is "suspicious", or to be characterized as an anomaly. I especially do not like the use of the word anomaly in the issues here, because (to me) that word implies that one understands the expected behavior of the distributions, and then sees a deviation from that expected behavior.

The repo started from an implied assumption that we should expect Benford's Law, which I think got everything off on the wrong track. (Of course, it also didn't help that probably almost everyone here came in with a "Bayesian prior" of whether they believe there was fraud.)

Since I am back working now, I'm not sure I've kept up with every reply to every issue. But it did look like there were some promising efforts to figure out what the expected behavior is.

I'll also drop this one thought here, since I might not be back for a while. Another cloud hanging over everything is the use of just 3 counties (or more when I wasn't looking?), out of the more than 3,000 counties in the U.S. If the original repo author seriously cherry-picked these datasets, then there is an awful lot of effort spent fitting models to datasets that could themselves be weird due to sampling variation alone.

charlesmartin14 commented 3 years ago

@MechanicalTim These are great questions. And, like you, I'll come back after work to re-visit

snex commented 3 years ago

@MechanicalTim I have been analyzing counties outside of the "suspect" ones (Miami-Dade FL and Cuyahoga OH for example), and at least when it comes to Benford 2nd digit, both major parties conform quite well. It is difficult to collect data from many counties at the precinct level, as they simply don't supply such information much of the time. If anyone is able to source this data, I would love to run it through my tool.

charlesmartin14 commented 3 years ago

@snex. Can we get this additional data checked into the repo here as we collect it ? Thanks

snex commented 3 years ago

It is currently in my own repo over at https://github.com/snex/election_results_benford. Feel free to pull it in here. I am using XML files where available so you may have to convert those.

frycast commented 3 years ago

We know a few things about the data generating process. We know the data are discrete. We also know that the number of votes can't drop below 0.

There are (mostly) two sets of counts happening in each ward: the number of Trump supporters arriving and the number of Biden supporters arriving. That suggests we approximately have two Poisson processes in each ward, with different rates between wards. So the whole county could be modelled by a set of independent but non-identically distributed tuples of Poisson random variables. There are other possibilities, of course. This may not be the best choice.

The above suggests it's unsurprising that we are seeing Poisson-like behaviour overall in the final distribution of counts.

frycast commented 3 years ago

He would argue that this is evidence of some kind of effective election strategy by Biden. That is, this is a kind of herding behavior, where everyone in the sample is behaving (nearly) exactly the same ways

@charlesmartin14 if it is true that this herding behaviour effect is playing a role, then a simple explanation could be the massive impact of conditions related to the pandemic.

charlesmartin14 commented 3 years ago

Sorry guys, I'm busy for the next few hours..I'm on CA time.

But , just briefly, Here's a 5 min workup of The Donald's Election Day Data

Screen Shot 2020-11-09 at 2 27 51 PM

A Poisson distribution would have an exponentially decaying tail...this data does not

The data D appears to be best described by a Truncated Power Law, with power law exponent ~3.6

(Im doing this quick over a break; If I am in error, please repeat the analysis and do the actual Poisson)

MechanicalTim commented 3 years ago

@frycast, I was using the Poisson distribution mostly as an example of a distribution that can have very different look, depending on parameters.

But I'm hesitant to think of the election as a poisson (or even poisson-like) process, mainly because I cannot wrap my head around what the analogous thing to "arrival over a time or spatial interval" is -- voting doesn't seem fit that "interval" concept to me -- and also other poisson assumptions like independence of individual events, and so on.

EDIT: clarify what I meant by "arrival"

charlesmartin14 commented 3 years ago

This data , with the Power Law exponent between 2 and 4, is consistent with a scale-invariant generating process, (subject to finite size effects). Or classical herding behavior. I can go into detail later but this is exactly what I mean structurally different...the tail behavior is different

And It is exactly this kind of heavy-tailed , scale-invariant behavior that the Benford's Law is testing for.

That's why Trump's data kinda looks Benford, and Biden's does not (on the first digit test).

charlesmartin14 commented 3 years ago

Ok, so back to this

@frycast #MechanicalTim

But I'm hesitant to think of the election as a poisson (or even poisson-like) process, mainly because I cannot wrap my head around what the analogous thing to "arrival over a time or spatial interval" is -- voting doesn't seem fit that "interval" concept to me -- and also other poisson assumptions like independence of individual events, and so on.

This is my point about Biden's Election Day voting data in this thread. Trump is clearly not Poisson or Poisson-like (not sure what that means, but I take it to mean having a large peak at low values and an exponentially decaying tail).

But Biden's data may in fact be Poisson or Poisson-like. (analysis to come, although I encourage others to do this too).

@frycast if it is true that this herding behaviour effect is playing a role, then a simple explanation could be the massive impact of conditions related to the pandemic.

Maybe..but (I dont think) the Biden election day data is displaying this kind of herding behavior, unless, as pointed out elsewhere (I forgot the issue here), the democrats' behavior is just oddly collectivist

charlesmartin14 commented 3 years ago

@andrewzigerelli Here are the QQ plots. Great suggestion thanks

Screen Shot 2020-11-09 at 7 39 51 PM
charlesmartin14 commented 3 years ago

And here is the same data for Hilary, 2016

Seems to have the same pattern as Biden 2020, but 1/2 the voter turnout

Screen Shot 2020-11-09 at 8 46 13 PM
panicfarm commented 3 years ago

And here is the same data for Hilary, 2016

Seems to have the same pattern as Biden 2020, but 1/2 the voter turnout

Screen Shot 2020-11-09 at 8 46 13 PM

Is there counts distribution for other candidates in the same district in 2016?

charlesmartin14 commented 3 years ago

This ?

Screen Shot 2020-11-09 at 9 05 26 PM
panicfarm commented 3 years ago

This ?

Screen Shot 2020-11-09 at 9 05 26 PM

yes. it looks more Gaussian than Trump's, with a distinct peak at 200. In 2020 there was a similar peak at 100. What is special about these counts (100 in 2020 and 200 in 2016)?

charlesmartin14 commented 3 years ago

@panicfarm That's a good observation...and why are there these large outlier peaks ? Are these statistical fluctuations or structural anomalies? Not sure

I didn't check Trump's data for 2016 but I assume is a Truncated Power Law like 2020, not Gaussian.

With regard to using Benford's Law, I think we need to generalize the approach to the constraints on the data we have in voting.

I suspect that Benford's Law data is (almost always*) heavy-tailed but heavy-tailed data is not always Benford. Especially in cases like this, where they may be a soft lower bound on the data, and the sample sizes small.

So, from Benford, we conjecture that naturally occurring data is heavy-tailed, but with finite-size effects

If we want to detect fraud or other unusual patterns, we want to look for the signatures of finite-size heavy-tailed behavior. I think what we want to look at both the statistics of these large fluctuations near the mean, the shape of the tail.

That is, are the devitations from normality within the bounds of the central limit theorem for this size data, or are they way outside.

And, of course, does this tell us anything, or raise any suspicions, about the true voting patterns

robscovell-ts commented 3 years ago

We know a few things about the data generating process. We know the data are discrete. We also know that the number of votes can't drop below 0.

There are (mostly) two sets of counts happening in each ward: the number of Trump supporters arriving and the number of Biden supporters arriving. That suggests we approximately have two Poisson processes in each ward, with different rates between wards. So the whole county could be modelled by a set of independent but non-identically distributed tuples of Poisson random variables. There are other possibilities, of course. This may not be the best choice.

The above suggests it's unsurprising that we are seeing Poisson-like behaviour overall in the final distribution of counts.

That could kinda work if you are considering a normal voting scenario where people vote in person, although factors like voting before or after work would result in non-Poisson clusters. However, the disputed votes are postal votes, which invalidates use of a Poisson model for those votes.

charlesmartin14 commented 3 years ago

@robscovell-ts The issue I am raising in this thread is that I don't see why real voting data like this would ever be Poisson (or Gaussian). it should be heavy-tailed, even if it's not Benford.

Compare the Biden 2020 Election Day (not postal) data with the Hilary 2016 data.

Screen Shot 2020-11-11 at 7 05 25 AM

Here., the tails of the distribution (say > 400) are very similar, Trump and Hilary's data strongly overlap.

Of course, many of the voting rules in PA were changed for 2020, so who knows how that affects the comparison.

This is what I think we need to check, and for different cases.

charlesmartin14 commented 3 years ago

Here's the data for the Trump vs Biden winning in their own districts

Trump data in the Trump winning Districts; Biden data in the Biden districts

I think this is consistent with the data in #17 presented by @markr-github, where the Biden data is Benford-like in the Trump winning districts.

Notice that the Trump data, however, does have a long tail in the Trump winning districts, despite it clearly being non-Benford. it does not look Gaussian or Poisson to me.

Screen Shot 2020-11-11 at 8 28 15 AM

Election Day Data for both candidates in Biden-winning districts

As expected, the Trump data is Benford-like (or maybe Poisson, I have not checked carefully, but I doubt it.)

Screen Shot 2020-11-11 at 8 29 57 AM

Election Day Data for both candidates in Trump-winning districts

Now we see the Biden non-Benford data (which seems weird the trump winning district ?) and we do see a little bit of tail in the Biden data...

This is the data that looks suspicious...

Screen Shot 2020-11-11 at 8 17 03 AM

(I encourage others to double check these results to confirm.)

charlesmartin14 commented 3 years ago

I have a working hypothesis now for why this data might be Gaussian in case of fraud. It has been suggested that the Biden election day votes were changed into mail-in votes in order to make the distribution of fake mail-in votes seem more realistic. if this was done naively, then it would have been done in such a way as to leave the election day data to look random 5Gaussian, instead of the more realistic random heavy-tailed.