cjph8914 / 2020_benfords

368 stars 82 forks source link

Research suggests Benford is unreliable in Election Fraud Detection #11

Open mlamias opened 3 years ago

mlamias commented 3 years ago

Nice analysis. However, I wanted to point you to a few articles that may be of interest to you. Essentially the research suggests Benford's is unreliable when applied to election data:

https://repository.library.georgetown.edu/handle/10822/557850

https://www.jstor.org/stable/23011436?seq=1

https://courses.math.tufts.edu/math19/duchin/dmo.pdf

https://www.cambridge.org/core/journals/political-analysis/article/benfords-law-and-the-detection-of-election-fraud/3B1D64E822371C461AF3C61CE91AAF6D

mlewis1973 commented 3 years ago

yes it's been shown to be susceptible to false positives..... deviations from Benford's when there is no fraud. In this particular case, the total number of voters per precinct does not span many orders of magnitude, so Benford's law is unlikely to apply. Here's a histogram of the sum of Biden and Trump voters which is a decent approximation to the number of voters per precinct. It would be interesting to see any scholarship where they look at the distribution of deviations from Benford's law. Would you expect Trump leaning districts in PA to also have deviations from Benford in similar frequency to Biden's winning areas?

Screen Shot 2020-11-06 at 11 31 32 PM
jimfcarroll commented 3 years ago

It seems this depends on precinct distribution. In Russia and Iran for example, the reason some of the papers claim you can't use the first digit is the precincts are too evenly distributed. This makes sense, since you wont get numbers that span orders of magnitude.

However, in at least some cities in question in the US, this doesn't appear to be the case by precinct. In Milwaukee for example, the number of registered voters is as low as 4 and goes as high as several thousands. Granted, most are on the order of 10^2 and 10^3.

Then again, if this is a deal breaker, why is it that it seems to provide expected distributions for Trump and Jo Jorgensen, but only flag anomalies with Biden?

nickcorona commented 3 years ago

It seems this depends on precinct distribution. In Russia and Iran for example, the reason some of the papers claim you can't use the first digit is the precincts are too evenly distributed. This makes sense, since you wont get numbers that span orders of magnitude.

However, in at least some cities in question in the US, this doesn't appear to be the case by precinct. In Milwaukee for example, the number of registered voters is as low as 4 and goes as high as several thousands. Granted, most are on the order of 10^2 and 10^3.

Then again, if this is a deal breaker, why is it that it seems to provide expected distributions for Trump and Jo Jorgensen, but only flag anomalies with Biden?

Because Biden is the one being accused of election fraud.

SageGaspar commented 3 years ago

It seems this depends on precinct distribution. In Russia and Iran for example, the reason some of the papers claim you can't use the first digit is the precincts are too evenly distributed. This makes sense, since you wont get numbers that span orders of magnitude.

However, in at least some cities in question in the US, this doesn't appear to be the case by precinct. In Milwaukee for example, the number of registered voters is as low as 4 and goes as high as several thousands. Granted, most are on the order of 10^2 and 10^3.

Then again, if this is a deal breaker, why is it that it seems to provide expected distributions for Trump and Jo Jorgensen, but only flag anomalies with Biden?

This repo is attempting to apply Benford's Law to vote count distribution, so that's what actually needs to span multiple orders of magnitude. Precinct distribution is a factor in vote count distribution but it doesn't tell the whole story.

I don't see the Milwaukee data in this repo, but take a look at the Chicago data: https://github.com/cjph8914/2020_benfords/blob/main/data/chicago_dataexport.csv

Biden's vote totals are solidly contained within one order of magnitude, the 100-999 range. Trump's vote totals range from single digits into the hundreds, across three orders of magnitude. Jo Jorgensen is mostly in the 0-20 range, across two orders of magnitude.

jimfcarroll commented 3 years ago

@SageGaspar good explanation

dshield55 commented 3 years ago

yes it's been shown to be susceptible to false positives

What are the odds of a false positive? If you look at 6 counties where fraud was suspected and found positives in 6 of them, do we assume that getting 6 positives is likely really to be false positives in all 6? What are the odds if only 2 positives are found that both are actually false positives?

harrybrwn commented 3 years ago

I don't see the Milwaukee data in this repo

The Milwaukee data is being scraped from this site. The vote counts in that data go up to 2800 for Biden and 2000 for Trump.

Telofy commented 3 years ago

I wonder if you could reanalyze the data of those very evenly sized districts with similar numbers of votes for Biden in base 5 or so to get more orders of magnitude at the cost of fewer different digits. Maybe the smaller differences will then span enough orders of magnitude after all that the Benford-like distribution will appear again.

charlesmartin14 commented 3 years ago

The whole point of the first digit indicator is to find unusual situations. We know that Benford's Law can not be applied to datasets with upper and lower bounds. The question is, why do the Chicago patterns for Biden's cluster in the range from 100-999 ? Is this a natural voting pattern, or is it due to something else ? Researchers in election forensics see this behavior all over, and they generally explain this as election strategy ? In other words, get-out-the-vote efforts and so on. But they don't really know because they are not on the ground in Chicago. This could be something else, like buying votes, voter intimidation, ballot stuffing, etc. We don't know.

It is exactly this anomalous behavior that is in question.

This is why researchers look for other patterns in the voting data, such as second digit and last-digit patterns, as well as the distributional patterns of the vote counts themselves.

Here is a nice talk on the subject from one of the leaders of the field https://www.youtube.com/watch?v=zkx_eO0PvXU

Notice that they never look at the distribution of digits itself, but, rather, are looking for statistical indicators that characterize the distributions, such as a good mean value and reliable upper and lower bounds.

In the situations we are seeing in many cities, on the surface , the voting patterns are seemingly so odd to as to qualify in the Klimick model as extreme fraud (go to t=3522s )but to apply the model one may need to examine the data in more detail