cjph8914 / 2020_benfords

368 stars 83 forks source link

Milwaukee ward sizes are small and there is a highly preferred candidate #17

Open frycast opened 3 years ago

frycast commented 3 years ago

The disappearance of Benford's law in Milwaukee is a function of voter preference alone. If one candidate has between 60% and 80% average chance of receiving a vote, then the sizes of the wards in Milwaukee are too small to accommodate Benford's law. See further details with my simulations here https://rpubs.com/frycast/687633

Edit: Not just too small, but too concentrated. They do not span many orders of magnitude.

Edit 2: The thread below becomes distracted by an effort to look into election data anomalies that are not directly related to this issue. My intention here is not to develop a fraud detection tool, but to highlight the major flaws with the one being used, and currently being touted by various news sources as evidence of fraud. So far, this issue is still open, and should be resolved by at least adding some comments to the README clarifying that the pattern observed in Milwaukee is a pattern that can arise in election data absent of fraud. Hopefully the owner of this popular repository, and the people involved here in this thread, are all interested in acting in good faith, and will focus on resolving the issue.

chavenor commented 3 years ago

@1e100 see attached for all in single line formatting. fl-single-line.zip

chavenor commented 3 years ago

Let me know if you have any more you want me to run -- very interested to see the results on this one.

frycast commented 3 years ago

Your simulation is not the kind of baseline I mean. It is a finite size estimator using a different prior. And its not really even that because you did not do any statistical estimation to see if the underlying vote data fits the distributional form you are simulating.

Just to clarify again. The simulation is not intended to estimate anything, such as the distributional form of the vote data. That would be counterproductive since the distributional form of the vote data is the object under suspicion. To me it's important that the criticism on here remains relevant to the purpose of the thread. That's why I originally downvoted your response. Downvote removed.

When researchers look for fraud or other anomalies, they are looking to see if the data is random, or if there are correlations that cause the data to exhibit fluctuations or other patterns that are statistically unlikely (assuming, say, i.i.d. data). You haven't done that yet.

Of course. That's not what I intend to do, since developing a fraud detection tool is not the purpose of this particular open issue. Perhaps start a new issue, to avoid confusing people coming here to contribute to this one. I totally commend your efforts to dig deeper, but I think it needs to be done in an organised manner, or else it just appears to be politically driven, and will confuse people about whether this issue is resolved.

I want to create a baseline to examine the prior on the probabilities being estimated from the empirical frequencies in the 2020 data. This would then be used in downstream analysis of the distributions...

This should probably be the focus of the new issue that should be opened.

frycast commented 3 years ago

Come to think of it, one thing that could explain my vote/turnout histograms above is the preference of Trump voters for in-person voting. That is, if both sides vote at some fraction by mail (say 70/30 for Biden/Trump in this case) but then still more Trump voters also show up in person, that will, at the same time, boost turnout for Trump, and reduce Biden's percentage of the vote. But, of course, this needs to be verified with data, because what I have here is pure conjecture.

This is another interesting question that perhaps belongs in a whole separate repository, that could be owned by a more active user, for the purpose of investigating all of these interesting properties of the election data.

SageGaspar commented 3 years ago

I've pushed the code to https://github.com/1e100/2020_benfords. Disclaimer, once again: I do not claim there is any fraud here. I'd like to see an explanation to Biden's "the lower the turnout, the higher the vote"

The entire story of elections in America is that democrats outnumber republicans, but republicans show up to vote. There are multiple theories around this ranging from just being less enthusiastic about their candidates to voter suppression, but that's out of the scope of this.

For some evidence, as of this election there are 4.2 mil dems vs 3.5 mil republicans in PA: https://docs.google.com/spreadsheets/d/1LEkTZN_1Ee5AVkxqgVdh1OWadz85qxqxU8HHF8BvcCY/edit?usp=sharing

However with most of the vote in, there are only 3.35 million votes for Biden, and 3.31 million votes for Trump. So the expectation is that turnout among dems would be proportionally lower. These graphs are consistent with that -- districts where Biden has a higher percentage of the vote tend to have a higher percentage of democrats who tend to drag down the turnout percentage. There are just more people in those districts, so that lower percentage equates to a higher raw total of votes.

Here are the results from Allegheny in 2016 showing a similar correlation: https://imgur.com/a/nmdtoh4

Here's a google sheet for Allegheny from 2016 if you want to play with the data yourself: https://docs.google.com/spreadsheets/d/1r9fVxYwIKQkUz8SYHCQmxWSG5bS8GE6fOnN_YtOvAwk/edit?usp=sharing

charlesmartin14 commented 3 years ago

@frycast

Fair enough.

Generally speaking, however, you can not use an arbitrary random number generator to create data and then expect it to be Benford. Random data is not Benford.

So you may be using a methodology that generates non-Benford data in all cases, and claiming it is evidence that the distributions are non-Benford on certain subcases. That may not be a good baseline.

That said, I suspect the qualitative shape of the distributions digit distributions you generated is probably correct.

But I think doing more simulations to create a distributional baseline in this way would open up more questions and not really help the story here.

1e100 commented 3 years ago

I've pushed the code to https://github.com/1e100/2020_benfords. Disclaimer, once again: I do not claim there is any fraud here. I'd like to see an explanation to Biden's "the lower the turnout, the higher the vote"

The entire story of elections in America is that democrats outnumber republicans, but republicans show up to vote. There are multiple theories around this ranging from just being less enthusiastic about their candidates to voter suppression, but that's out of the scope of this.

For some evidence, as of this election there are 4.2 mil dems vs 3.5 mil republicans in PA: https://docs.google.com/spreadsheets/d/1LEkTZN_1Ee5AVkxqgVdh1OWadz85qxqxU8HHF8BvcCY/edit?usp=sharing

However with most of the vote in, there are only 3.35 million votes for Biden, and 3.31 million votes for Trump. So the expectation is that turnout among dems would be proportionally lower. These graphs are consistent with that -- districts where Biden has a higher percentage of the vote tend to have a higher percentage of democrats who tend to drag down the turnout percentage. There are just more people in those districts, so that lower percentage equates to a higher raw total of votes.

Here are the results from Allegheny in 2016 showing a similar correlation: https://imgur.com/a/nmdtoh4

Here's a google sheet for Allegheny from 2016 if you want to play with the data yourself: https://docs.google.com/spreadsheets/d/1r9fVxYwIKQkUz8SYHCQmxWSG5bS8GE6fOnN_YtOvAwk/edit?usp=sharing

Could be. According to this data GOP voter turnout in Allegheny slightly exceeds 100% (so there are probably some unaffiliated in it), whereas Joe's voter turnout is about 74%. I'm not sure what "count of all other voters" means in the spreadsheet, though.

charlesmartin14 commented 3 years ago

On the voter participation data...

I think the question that is relevant to this thread is,

If are the Biden vote counts distributed normally in the range 300-400, (and therefore non-Benford) , whereas the Trump are Benford-like, are the overall voter participation trends consistent with vote distributions seen ?

frycast commented 3 years ago

So you may be using a methodology that generates non-Benford data in all cases, and claiming it is evidence that the distributions are non-Benford on certain subcases.

That is a good criticism.

I think this isn't a problem here though. My argument for this is that the simulated vote count distributions do look visually very similar to the observed ones, for both Biden and Trump (and not just the Benford distributions).

A visual comparison would not be sufficient in many cases, but in this case, especially since no inference is being made about the true distribution, we can see that, even if the underlying data generating process is being misrepresented, there is enough agreement to justify a visual comparison of the Benfords.

So a clearer conclusion is, if the data are generated binomially, with no difference in DGP between Biden and Trump, other than the probability of receiving a vote, then the observed data visually agree in count and benford distribution for both Biden and Trump.

charlesmartin14 commented 3 years ago

@frycast That's fine

I think the hump at 3 for the Biden first digit data can also be inferred just by looking at the distribution of vote counts. And that's the object to look at: See #31

markr-github commented 3 years ago

frycast, you seem right.

With most precints >500 voters, winners' counts will generally start with 2 or more so anyone who wins many counties won't look like Benford's Law. If you want to apply Benford, then I'd apply Hitchens and say you need to demonstrate that the numbers should follow it; “That which can be asserted without evidence, can be dismissed without evidence."

Here are the Allegheny PA vote counts split by precinct winner. Allegheny_candidate_counts_by_winner

The Trump counts in Trump precincts don't follow Benford but that's not evidence that Trump counties were committing fraud to give him the election.

Similarly the UK elections in 2019 where the Conservatives won most seats, below are counts by constituency for the four largest parties and the Conservative non-Benford-ness is not fraud, it's just the result when you win lots of ~50k-vote constituencies and typically run the vote close when you lose.

UK_election_benfords

charlesmartin14 commented 3 years ago

@markr-github That's very interesting. Let me suggest plotting the vote count distributions themselves.

Benford's Law data is heavy-tailed, but heavy-tailed data may not be Benford.

We can see if the data is heavy-tailed or not by looking at plots of the vote count distributions

(and by checking the tail statistics; you can use the powerlaw packages in R or python to do this)

*Also, can you share the data sets and notebooks if checked in

markr-github commented 3 years ago

@charlesmartin14 Didn't see anything in the vote distributions, total votes per precinct etc that say Benford's should apply to these numbers.

Not in a notebook, but the code and data are here: https://github.com/markr-github/benford-election

charlesmartin14 commented 3 years ago

@markr-github. Thanks I'll take a look after work today.

I think what we have learned so far is that when we see deviations from Benford's Law, the data is clustered around a high (say 100-200 votes). I'm just a bit surprised that in the cases I have looked at so far (Biden's Election Day data for Allegheny) the data appears nearly perfectly Gaussian and not seemingly heavy-tailed (i.e Biden's Absentee Data, Trump's data, etc) That is, it appears that there are (unusually?) very few districts with really high turnout for Biden. See #31

But maybe there is just not enough data to see the tail? That could certainly be, and it may be necessary to study total Biden districts across say an entire state ? I'm still checking that and need to do more careful tests.

This also, however, appears to be how the Biden vote distributions. That's the in charts that @andrewzigerelli is showing above if I understand this correctly. The higher the turnout in a district, the lower the Biden percentage. And the exact opposite for Trump.

markr-github commented 3 years ago

@charlesmartin14 Seems like a different topic to the Benford issues?

That relationshpi between turnout vs margin is exactly what I would have guessed beforehand so it doesn't surprise me.

In every US presidential election with data, the highest-turnout ethnic group has been "white non-hispanic": https://www.statista.com/statistics/1096113/voter-turnout-presidential-elections-by-ethnicity-historical/ And turnout is higher for older versus younger voters: https://www.politifact.com/article/2020/mar/04/closer-look-turnout-young-voters-and-key-bernie-sa/ I would expect that groups that are older and more non-hispanic white will (i) have higher turnout and (ii) have a more pro-Trump margin.

If you were convinced that fraud was happening then the naive approach would be to look at high-turnout areas, since more ballots increases the probability of "fake" ballots being included. I don't think there's any evidence that Trump precincts were fabricating votes though.

charlesmartin14 commented 3 years ago

@markr-github

Seems like a different topic to the Benford issues?

I asked the question because I see Benford's Law as a statistical test for heavy-tailed behavior, characteristic of natural (i.e. not fake) data. I agree, I don't think it can be interpreted using a naive approach However, there are other tests for heavy-tailed behavior, more suitable to finite-size systems, that might prove more useful here.

The simplest of these is to fit the tail of the data to truncated power law distribution , and then compare this to an exponential distribution using a non-parametric Kolmogorov–Smirnov test see #31

more ballots increases the probability of "fake" ballots

But is also increases the probability of "real" ballots, so it says nothing about the signal-to-noise ratio, which will certainly affect any estimator we use

charlesmartin14 commented 3 years ago

@frycast

And notice...Taleb also used normal random data as an example of Benford

https://twitter.com/nntaleb/status/1326212740273278978

This seems qualitatively correct.

I don't think it's helpful, however, to chime in. I prefer to avoid a flame war on Twitter.

There are lots of smart people here and I think we should just figure this out ourselves. Maybe there is something here, maybe not. I'm hoping to see more once we dig into the vote distributions.

chavenor commented 3 years ago

This just hit the web. Do we have a way to check this or comment on it? Do we need to open another issue? https://www.pscp.tv/w/1BdGYYjgkgQGX

MechanicalTim commented 3 years ago

I would encourage anyone planning on watching that video to read Dr. Shiva Ayyadurai's Wikipedia page as well. Here are the first few sentences, for your convenience:

V. A. Shiva Ayyadurai (born Vellayappa Ayyadurai Shiva,[2] December 2, 1963)[3] is an Indian-American scientist, engineer, politician, entrepreneur, and promoter of conspiracy theories and unfounded medical claims. He is notable for his widely discredited claim to be the "inventor of email".

chavenor commented 3 years ago

@MechanicalTim agreed. Can the data be grabbed and can we run this on our own to either confirm or deny the outcome?

charlesmartin14 commented 3 years ago

@chavenor We should try to get the data ourselves. I would also suggest to reach out to the researcher at MIT

chavenor commented 3 years ago

@charlesmartin14 I'm way ahead of you. Already asked on Twitter. Who was the guy from MIT? Did they have their info on that presentaiton? ok-fine

chavenor commented 3 years ago

I found the other guys and reached out on LinkedIn. Hope they can share their data with us so we can double-check it.

alexsullivan114 commented 3 years ago

@chavenor For reference, Dr. Shiva Ayyadurai ran for the senate as a Republican in Massachusetts. He's considered a bit of a joke over here.

charlesmartin14 commented 3 years ago

@alexsullivan114 It doesn't matter. What matters is getting data and doing our own honest analysis.

chavenor commented 3 years ago

@alexsullivan114 there seems to be a trend that anyone that is anti-establishment gets the "crazy stamp" -- I've moved beyond that prism.

They made claims. I've asked for the data. If we get it and can verify the results then that is all the proof we should need.

I didn't see that @charlesmartin14 already responed. Tossed ya a thumbs-up happy to have your input.

alexsullivan114 commented 3 years ago

Sure - totally fair. I was just trying to add some context about who this person was - of course the data should stand on its own.

charlesmartin14 commented 3 years ago

@alexsullivan114

There are 3 people presenting, one of which is a state election commissioner. https://www.shelbyvote.com/team/bennie-smith

Remember also that there are claims that some media companies like Twitter, CNN, etc. are actively censoring information claiming to be (potential) evidence of fraud. So he may be forced to go 'underground' , so to speak.

The data should speak for itself

MechanicalTim commented 3 years ago

It seems to me that the Shiva stuff is a case of deliberately deceptive plotting.

They display plots using the following data:

If we posit that people who are more likely to vote straight Republican are more likely to vote for Trump, then the mean percentages voting for Republican, and voting for Trump, might like something like this:

repub_prec_fraction = [20; 30; 40; 50; 60; 70; 80]; % and rough approx of "straight Rep" trump_likelihood = [25; 30; 35; 40; 45; 50; 55];

THE ABOVE ARE NOT REAL DATA! USED FOR ILLUSTRATIVE PURPOSES ONLY!

(Also, excuse the MATLAB syntax.)

Here are two subplots:

Shiva nonsense

(In Shiva's plot, there is of course the random scatter of real data around those lines.)

He then claims that this shape is somehow evidence of Biden stealing votes from Trump.

I have admittedly over-simplified a bit, for the sake of making my fundamental point more directly. But I think this is at the heart of Shiva's plot. I think he is obscuring truth, not revealing it.

Shiva does other deceptive things on the plot, like adding lines to "guide the eye", which, if you ignore them, you realize do not actually follow the data. There are also edge effects on the plot, that he ignores. Finally, he also makes verbal statements that are similarly deceiving.

I rate the video 1 out of 10. Would not watch again. (Disclaimer: I only watched the first 37 minutes before writing this.)

charlesmartin14 commented 3 years ago

This should be moved to another thread

chavenor commented 3 years ago

@MechanicalTim I do not believe that is what they are saying - I took - Straight ticket as assuming that all Republicans vote for Trump and as a precinct gets more Republican you would expect that the number would be at 0% not down -25%. Also, this does play into the discussion above about lower Dem turnout and trying to figure out where the votes came from.

I'll wait for the data so can just see what they did.

Moved here. https://github.com/cjph8914/2020_benfords/issues/38

RexRookie commented 3 years ago

It seems to me that the Shiva stuff is a case of deliberately deceptive plotting.

They display plots using the following data:

  • fraction who voted "straight Republican" (but guessing this means for non-Prez races?)
  • fraction who voted for Trump

If we posit that people who are more likely to vote straight Republican are more likely to vote for Trump, then the mean percentages voting for Republican, and voting for Trump, might like something like this:

repub_prec_fraction = [20; 30; 40; 50; 60; 70; 80]; % and rough approx of "straight Rep" trump_likelihood = [25; 30; 35; 40; 45; 50; 55];

THE ABOVE ARE NOT REAL DATA! USED FOR ILLUSTRATIVE PURPOSES ONLY!

(Also, excuse the MATLAB syntax.)

Here are two subplots:

  • Top: Plot that relationship straightforwardly
  • Bottom: Plot it using the contrived variable from the video

Shiva nonsense

(In Shiva's plot, there is of course the random scatter of real data around those lines.)

He then claims that this shape is somehow evidence of Biden stealing votes from Trump.

I have admittedly over-simplified a bit, for the sake of making my fundamental point more directly. But I think this is at the heart of Shiva's plot. I think he is obscuring truth, not revealing it.

Shiva does other deceptive things on the plot, like adding lines to "guide the eye", which, if you ignore them, you realize do not actually follow the data. There are also edge effects on the plot, that he ignores. Finally, he also makes verbal statements that are similarly deceiving.

I rate the video 1 out of 10. Would not watch again. (Disclaimer: I only watched the first 37 minutes before writing this.)

It's exactly what happens with their plots, that's the whole story :) Well said.

chavenor commented 3 years ago

https://github.com/stunnashades/ga-discrepancies/blob/main/lean-vs-delta.png