cjph8914 / 2020_benfords

369 stars 83 forks source link

Milwaukee ward sizes are small and there is a highly preferred candidate #17

Open frycast opened 3 years ago

frycast commented 3 years ago

The disappearance of Benford's law in Milwaukee is a function of voter preference alone. If one candidate has between 60% and 80% average chance of receiving a vote, then the sizes of the wards in Milwaukee are too small to accommodate Benford's law. See further details with my simulations here https://rpubs.com/frycast/687633

Edit: Not just too small, but too concentrated. They do not span many orders of magnitude.

Edit 2: The thread below becomes distracted by an effort to look into election data anomalies that are not directly related to this issue. My intention here is not to develop a fraud detection tool, but to highlight the major flaws with the one being used, and currently being touted by various news sources as evidence of fraud. So far, this issue is still open, and should be resolved by at least adding some comments to the README clarifying that the pattern observed in Milwaukee is a pattern that can arise in election data absent of fraud. Hopefully the owner of this popular repository, and the people involved here in this thread, are all interested in acting in good faith, and will focus on resolving the issue.

chavenor commented 3 years ago

I believe you are correct.

https://results.enr.clarityelections.com/PA/Allegheny/63905/Web02.193333/#/cid/0104

tcauth commented 3 years ago

The disappearance of Benford's law in Milwaukee is a function of voter preference alone. If one candidate has between 60% and 80% average chance of receiving a vote, then the sizes of the wards in Milwaukee are too small to accommodate Benford's law. See further details with my simulations here https://rpubs.com/frycast/687633

but according to your simulation: Trump's 39% chance should also show some observable difference from Benford's law, however ...

dshield55 commented 3 years ago

@frycast thanks for your great work. I'd like to second @tcauth , wondering if there's a reason for that

chavenor commented 3 years ago

@tcauth would the smaller portion of votes possibly not move the needle due to the fact that they are blending in with the rest of the wards of equal size?

chavenor commented 3 years ago

@tcauth what does the 2016 election look like -- are we able to get the data to run it?

charlesmartin14 commented 3 years ago

If Biden is 61%, then say Trump is 39%. (ignoring the other candidates) We can take a screenshot to see the Benfords lawfor p=0.30 and see it is violated for this case as well

Screen Shot 2020-11-07 at 6 15 09 PM
charlesmartin14 commented 3 years ago

And this is the point of one the main research papers that claims not to use Benfords law in these situations.

https://www.cambridge.org/core/journals/political-analysis/article/benfords-law-and-the-detection-of-election-fraud/3B1D64E822371C461AF3C61CE91AAF6D?fbclid=IwAR2x18HnGzK7rQDrDhid25i-gUMo31xo6tyJT7UMK97YJWva7vbPApHWKSg

image

I think the issue to have the proper baselines, which may be determined by looking at the 2016 data

frycast commented 3 years ago

but according to your simulation: Trump's 39% chance should also show some observable difference from Benford's law, however ...

Trump's average vote probability was roughly 26% in Milwaukee for this election, and Biden's was roughly 73%. With those numbers, Trump gets a nice Benford's law in my simulation, and Biden gets the observed spike around the digit 4. I've updated the notebook at the same link so you can see Trump's distributions too: https://rpubs.com/frycast/687633

It makes sense: the populations in Milwaukee Wards are often around 500-1000, so 73% of a random Milwaukee ward population is a number that starts with a digit somewhere between 3 and 7.

tcauth commented 3 years ago

make sense. can we assume that if the samples are large enough to a state -> county level, the distribution should be better? nytime obtains a very good html source of that kind of data hence pd.read_html() read them directly. I got a MI chart from this: https://www.nytimes.com/interactive/2020/11/03/us/elections/results-michigan.html'

image https://github.com/tcauth/benford2020usvote/blob/main/benford.ipynb

tcauth commented 3 years ago

but according to your simulation: Trump's 39% chance should also show some observable difference from Benford's law, however ...

Trump's average vote probability was roughly 26% in Milwaukee for this election, and Biden's was roughly 73%. With those numbers, Trump gets a nice Benford's law in my simulation, and Biden gets the observed spike around the digit 4. I've updated the notebook at the same link so you can see Trump's distributions too: https://rpubs.com/frycast/687633

It makes sense: the populations in Milwaukee Wards are often around 500-1000, so 73% of a random Milwaukee ward population is a number that starts with a digit somewhere between 3 and 7. And I suppose the 1 should be the most important one, right?

1e100 commented 3 years ago

Care to explain this? The lower the turnout, the higher the Biden advantage. milwaukee_wards

1e100 commented 3 years ago

Here's Allegheny: allegheny

tcauth commented 3 years ago

Care to explain this? The lower the turnout, the higher the Biden advantage. milwaukee_wards

What are the x-axis and y-axis? Turnout and votes?

1e100 commented 3 years ago

Y is turnout, X is percentage of the votes for a given candidate in each precinct. Color is the size of the bucket. It's a 2D histogram.

tcauth commented 3 years ago

Y is turnout, X is percentage of the votes for a given candidate in each precinct. Color is the size of the bucket. It's a 2D histogram.

And left is Biden right is Trump?

1e100 commented 3 years ago

Yes. Earlier comment I made had them inadvertently swapped on the Allegheny histogram. I've deleted that comment and uploaded the correct one. I could push my fork if someone wants to play with this. In particular, it'd be interesting to see if this is a normally occurring phenomenon by comparing these with a county where Trump won. I started looking into Miami Dade, but precinct-level voter turnout data is not easily available for it, unfortunately.

1e100 commented 3 years ago

To be clear (for folks who will repost this on social media) - I'm not alleging any fraud or anything like that. Just surfacing a pattern which I thought was counterintuitive, and seeking an explanation.

tcauth commented 3 years ago

Yes. Earlier comment I made had them inadvertently swapped on the Allegheny histogram. I've deleted that comment and uploaded the correct one. I could push my fork if someone wants to play with this. In particular, it'd be interesting to see if this is a normally occurring phenomenon by comparing these with a county where Trump won. I started looking into Miami Dade, but precinct-level voter turnout data is not easily available for it, unfortunately.

It seems to be very easy to get and process state level which should be very solid in statistics

1e100 commented 3 years ago

Well, precinct level turnout data is technically available, but "available" in this case means a gigantic PDF, from which hundreds of values need to be copied out by hand. I don't have that kind of time or motivation.

frycast commented 3 years ago

Care to explain this? The lower the turnout, the higher the Biden advantage.

That's interesting. I haven't played around with that data yet. Do those turnouts count mail-in ballots? Is there also a negative correlation between turnout and something else that is correlated with Biden support, such as population density?

1e100 commented 3 years ago

Yes:

  1. Turnout = all votes / registered voters
  2. Vote share = votes for candidate / all votes
1e100 commented 3 years ago

I haven't studied other confounders, perhaps someone else will.

frycast commented 3 years ago

can we assume that if the samples are large enough to a state -> county level, the distribution should be better?

There's a pretty informative thread evolving on this in general here https://skeptics.stackexchange.com/questions/49782/do-vote-counts-for-joe-biden-in-the-2020-election-violate-benfords-law

pkit commented 3 years ago

@frycast it has nothing to do with what presented here. They discuss other graphs of unknown source.

charlesmartin14 commented 3 years ago

Trump's average vote probability was roughly 26% in Milwaukee for this election, and Biden's was roughly 73%.

That's very helpful, thanks

I assume you are estimating the true probabilities using the observed frequencies. But what if these are wrong, and significantly so ?

I think the big question is, Is it possible to detect this kind of fraud, with any level of. confidence, using advanced techniques. And then correlate it with real world behavior ?

A good starting point it seems would be to look at earlier data from 2016, and try to use this as a baseline for the analysis to and to estimate Biden and Trumps's true 'expected probability'm with some error bars, from that frequency data.

In Milwaukee, however, we see that turnout was not significantly different from 2016, and that the expected probabilities for Biden wold, in fact, be lower than 2016 in majority-Black wards.

https://www.jsonline.com/story/news/politics/elections/2020/11/07/election-results-milwaukee-turnout-flat-despite-wisconsin-surge/6188097002/

Is it possible to get the 2016 data ?

pkit commented 3 years ago

@charlesmartin14

I think the big question is, Is it possible to detect this kind of fraud, with any level of. confidence, using advanced techniques. And then correlate it with real world behavior ?

You cannot detect fraud using statistics alone. But you can detect statistical anomalies. Benford law helps in detecting these. Seeing it triggering for one candidate is an anomaly.

charlesmartin14 commented 3 years ago

You cannot detect fraud using statistics alone. But you can detect statistical anomalies. Benford law helps in detecting these. Seeing it triggering for one candidate is an anomaly.

Thanks for clarifying. By 'detect fraud' I mean to detect anomalies suggestive of fraud.

(I deleted this last part by accident when editing)

Is it possible to get the 2016 data ?

dshield55 commented 3 years ago

@frycast, you want to run that in base 16 for giggles?

frycast commented 3 years ago

I assume you are estimating the true probabilities using the observed frequencies. But what if these are wrong, and significantly so ?

"Empirical probability" means the probability on the observed data alone. There is no need for any estimate of the true probability here. We only need the empirical probability.

This is because my simulation demonstrates that the disagreement with Benford's law arises even if the true probability is equal to the empirical probability. So, in other words, if there is no fraud, then you still get the observed disagreement in Milwaukee.

My simulation cautions against an erroneous use of Benford's law to try to detect election fraud in Milwaukee, since the observed result is exactly what we would expect if there was no fraud.

charlesmartin14 commented 3 years ago

@frycast, you want to run that in base 16 for giggles?

Why would that help ? The whole point of Benford's Law is that the distribution of first digits is scale invariant. The problem's applying it here appears mostly to be related to finite size effects and the assumption of a uniform prior. But I'm not a specialist with this method so maybe I am missing something ?

One of the academic arguments against using first digit Benford's Law for election forensices is that there is no baseline for comparison.

What I'd like to do is get the 2016 data and try to create a baseline.

frycast commented 3 years ago

@frycast, you want to run that in base 16 for giggles?

One of the academic arguments against using first digit Benford's Law for election forensices is that there is no baseline for comparison.

What I'd like to do is get the 2016 data and try to create a baseline.

My simulation is a kind of baseline. Your suggestion to add error bars is great. I thought about adding some by running the simulation many times. I may add that at some point.

If you use 2016 as a baseline, then (ironically) you run into a version of the problem you brought up yourself: the baseline 2016 empirical probability would be taken as the "true" probability, but would likely differ from the 2020 empirical probability. The advantage of a simulation is that you can control the true probability, to create the baseline yourself.

charlesmartin14 commented 3 years ago

My simulation cautions against an erroneous use of Benford's law to try to detect election fraud in Milwaukee, since the observed result is exactly what we would expect if there was no fraud.

The simulations show that the gross shape of the first digit Benford's law distribution crudely reflects the observed frequencies of the vote counts.

There are other statistics look at, both usings Benford's law and this data.

charlesmartin14 commented 3 years ago

My simulation cautions against an erroneous use of Benford's law to try to detect election fraud in Milwa

What you are saying is not to simply look at plots of the first digit distribution and try to apply Benford 's Law 'naively'. And that's helpful. There is are more things to do.

My simulation is a kind of baseline. Your suggestion to add error bars is great. I thought about adding some by running the simulation many times. I may add that at some point.

Your simulation is not the kind of baseline I mean. It is a finite size estimator using a different prior. And its not really even that because you did not do any statistical estimation to see if the underlying vote data fits the distributional form you are simulating. So, yes, doing more simulations would help.

When researchers look for fraud or other anomalies, they are looking to see if the data is random, or if there are correlations that cause the data to exhibit fluctuations or other patterns that are statistically unlikely (assuming, say, i.i.d. data). You haven't done that yet. This could be done using, say, a statistical test (like a Kolmogorov–Smirnov test)

Benford's Law, therefore, can be used to test the correlation structure in the 2020 data, as given. That's not what I am talking about.

I want to create a baseline to examine the prior on the probabilities being estimated from the empirical frequencies in the 2020 data. This would then be used in downstream analysis of the distributions reported.

This is a serious topic with big implications. Instead of downvoting me, please point out my errors or ask for clarification.

frycast commented 3 years ago

Your simulation is not a baseline.

Let me clarify. I'm not suggesting you use the simulation as a baseline for detecting fraud. However, if you did want to do something like that, then perhaps you could use some of the measures you mentioned to improve it, to make it a reliable baseline.

charlesmartin14 commented 3 years ago

Care to explain this? The lower the turnout, the higher the Biden advantage.

That does seem highly suspect. I think it would be helpful to compare to the 2016 data.

chavenor commented 3 years ago

Yes:

  1. Turnout = all votes / registered voters
  2. Vote share = votes for candidate / all votes

@frycast @1e100

I think this makes sense if there are more votes coming from first-time voters. Do these numbers take Independents into consideration?

Knowing how many non-registered voters voted would be pretty interesting. Is there a single data source for all of this information?

charlesmartin14 commented 3 years ago

@frycast

One has to be careful applying Benford's Law to randomly generated data for doing any kind of simulation such as creating a (what I'll call) a distributional baseline for Benford's Law

Here is a simple example of normally distributed data: despite the shape of the data, with a simple statistical test, we see it is clearly not distributed by Benford's Law

Screen Shot 2020-11-08 at 8 41 29 AM

Any simulation method should first be checked that the data itself is Benford under normal conditions

And I think this is why researchers don't use the method; they may not know the functional form of Benford's Law in these cases

Here, I think we may have better luck doing Bootstrap Resampling on data from 2016 (or some other place)

pkit commented 3 years ago

@charlesmartin14 population distributions are usually "Zipfian" and not Normal

andrewzigerelli commented 3 years ago

Here's Allegheny: allegheny

Are you implementing the PNAS paper from Klimek et al? https://www.pnas.org/content/pnas/109/41/16469.full.pdf

If so, are you making the repo public?

charlesmartin14 commented 3 years ago

@pkit. I'm referring to the simulations being used to create the distributional baseline, where random binomial trials have been used to simulate the data for this thread.

See also issue #9:

Benford's Law reflects something we know about natural data sets -- they are almost always heavy-tailed (i.e log-normal), not normally distributed. If a naive fraudster were to use a random ballot-stuffing scheme, we would expect the voting data to be normally distributed. And that's the point. Benford's Law tells us where to begin looking for fraud.

chavenor commented 3 years ago

Here's Allegheny: allegheny

Are you implementing the PNAS paper from Klimek et al? https://www.pnas.org/content/pnas/109/41/16469.full.pdf

If so, are you making the repo public?

Can we run this against the same data set but for the 2016 election? Curious to see if we see the same distribution mirrored for Trump.

I find it odd you can have lower voter turnout within your party and somehow that turns into a huge advantage. Did independents really carry the day for only Biden/Harries?

If these votes turn out to be real then the Democratic party has cracked the code for generating party energy and will essentially win every major election moving forward.

What am I missing on the "Benford" discussion it looks like high leaning B/H samples per ward would produce the results we are seeing. We saw the same thing in 2016 with Clinton -- obviously, we would have no idea if that election is as clean. My gut says it is due to the establishment reactions on election day.

MechanicalTim commented 3 years ago

The only credible sources of information I have been able to find are articles like this one, which strongly caution against blithely using Benford's Law in the way it has been used in this repo and issues.

I am trying to keep an open mind about the whole thing, and assuming folks are arguing in good faith for point of view that there is evidence for fraud here. But this repo and its issues are increasingly looking to me to be about a hazy attempt to sow distrust in our election process. I continue to see no compelling argument, theoretical or empirical, that this is evidence for fraud.

EDIT: I just noticed that the creator of this repo joined github this month. This would be in line with the repo being a deliberate disinformation campaign.

EDIT 2: Note that I am not saying that this is a disinformation campaign. I've just noticed an anomaly that is consistent with a disinformation campaign. It's a bit hazily defined. Oh, the sweet sweet irony.

chavenor commented 3 years ago

@MechanicalTim I encourage you to read through every issue on this repo. You will see there is opposition to the case made for Benford's Law that these numbers are out of wack due to the high voter turnout for B/H in these areas in questions. The 2016 election showed similar results, again a very large percent of vote breaking towards H. Even if this was not the case it doesn't imply fraud off the bat, but it could suggest a closer look at the operations may be warranted.

If the new graphs above (waiting for code and data sets from @1e100 ) based on Turnout vs Vote Percentage are accurate then this conversation just got a whole lot more interesting.

Trying to understand this historic election is a critical part of our republic and I do not know of a better way to do that than on Github where the data and code are available for others to replicate, test, and discuss.

1e100 commented 3 years ago

I've pushed the code to https://github.com/1e100/2020_benfords. Disclaimer, once again: I do not claim there is any fraud here. I'd like to see an explanation to Biden's "the lower the turnout, the higher the vote" phenomenon. It is quite possible that it is a legit pattern. Also, my Pandas-fu leaves much to be desired, this is the second time I'm using Pandas, so if anyone finds bugs, PRs are welcome, of course.

chavenor commented 3 years ago

Well, precinct level turnout data is technically available, but "available" in this case means a gigantic PDF, from which hundreds of values need to be copied out by hand. I don't have that kind of time or motivation.

Can you send me this file I might be able to parse it? What format do you want it in?

1e100 commented 3 years ago

https://www.miamidade.gov/elections/library/reports/voter-registration-statistics-precincts.pdf, CSV would be ideal, in the form:

precinct_name, total_voter_registrations

But if you can also parse out registrations by party that could be interesting for analysis as well.

MechanicalTim commented 3 years ago

FYI, there is a free program called Tabula that is very good at extracting tabular data from PDF files.

1e100 commented 3 years ago

Works remarkably well tabula-voter-registration-statistics-precincts.zip

charlesmartin14 commented 3 years ago

@MechanicalTim

The only credible sources of information I have been able to find are articles like this one, which strongly cautions against blithely using Benford's Law in the way it has been used in this repo and issues.

IMHO, they are a bit deceptive here. They claim that : the numbers must span multiple orders of magnitude, such as ranging from 100 to 10,000,000. .

Benford's Law applies to data sets that span ranges a lot smaller than that. Data from 0-1000 is sufficient if it is uniformly distributed on a log scale.

1e100 commented 3 years ago

Come to think of it, one thing that could explain my vote/turnout histograms above is the preference of Trump voters for in-person voting. That is, if both sides vote at some fraction by mail (say 70/30 for Biden/Trump in this case) but then still more Trump voters also show up in person, that will, at the same time, boost turnout for Trump, and reduce Biden's percentage of the vote. But, of course, this needs to be verified with data, because what I have here is pure conjecture.