cjph8914 / 2020_benfords

369 stars 83 forks source link

Understanding the plot #20

Open krist-jin opened 3 years ago

krist-jin commented 3 years ago

Can someone help me to understand the plots? I understand Benford law but what does the frequency mean in the plots? I know it's the frequency of some numbers/data of the vote but what exactly are these numbers? Where do they come from? Thanks!

1e100 commented 3 years ago

It's a model of a distribution of first digit frequencies (that is "1" in a "100" or "10", "2" in "21", "223" or "2563", etc) typically observed in naturally occurring datasets. It's empirical, there's no theoretical explanation for it, as far as I know. It says that in an untampered, naturally occurring distribution of numbers, you expect to see "1" as the first digit a certain percentage of the time, "2" a certain lower percentage of the time, and so on. Notably, you should not see any pronounced "humps", like you see on these histograms. This is not intended to be irrefutable evidence, merely a signal for further investigation of potential tampering.

krist-jin commented 3 years ago

@1e100 thanks for the reply! However, my doubts are about the source data/numbers used here. In your example, what does 21, 223 and 2563 mean for the vote? Where do these numbers come from? Basically I don't see the connection between the vote and the data being plotted here. Thanks!

1e100 commented 3 years ago

The numbers are vote counts by precinct. If they are not manipulated, the first digit in those vote counts usually follows Benford's distribution. The code in the notebook makes the derivation of these histograms pretty clear:

  1. Get vote counts broken down by precinct, for each candidate
  2. Extract the first digit of the counts
  3. See how often you see that digit, and compare with how often Benford's says it should occur.

Let me reiterate though, that a discrepancy in distributions by itself does not indicate fraud. It is only a sign that further audit may be worthwhile.

MechanicalTim commented 3 years ago

@1e100 , can you please provide a source for your statement, "If they are not manipulated, the first digit in those vote counts usually follows Benford's distribution."

1e100 commented 3 years ago

I’m sure you can find it yourself if you can use Google and genuinely want to know.

On Nov 8, 2020, at 08:23, MechanicalTim notifications@github.com wrote:

 @1e100 , can you please provide a source for your statement, "If they are not manipulated, the first digit in those vote counts usually follows Benford's distribution."

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

MechanicalTim commented 3 years ago

Actually, I am trying to give you the benefit of the doubt here. The only credible sources of information I have been able to find are articles like this one, which strongly caution against blithely using Benford's Law in the way it has been used here.

If you have actually found something equally credible, I am trying to keep an open mind about it.

I am assuming you are arguing in good faith for your point of view. But this repo and its issues are increasingly looking to me to be about a hazy attempt to sow distrust in our election process.

1e100 commented 3 years ago

I’ve gone out of my way several times to underscore that the lack of conformance to Benford’s by itself is not evidence of fraud (nobody but the illiterate Twitterati would claim that), but merely an indicator that a further investigation might be worthwhile. And I don’t need your “benefit of the doubt”. :-)

On Nov 8, 2020, at 14:11, MechanicalTim notifications@github.com wrote:

 Actually, I am trying to give you the benefit of the doubt here. The only credible sources of information I have been able to find are articles like this one, which strongly caution against blithely using Benford's Law in the way it has been used here.

If you have actually found something equally credible, I am trying to keep an open mind about it.

I am assuming you are arguing in good faith for your point of view. But this repo and its issues are increasingly looking to me to be about a hazy attempt to sow distrust in our election process.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

charlesmartin14 commented 3 years ago

@1e100 Frequently, real-world data is heavy-tailed, and, therefore, may span a large range, say from 0-10,000. The vote data here is not always so well distributed, but sometimes it is pretty good.

When you see 'humps' in the data, this may indicate some deviations from this expectation. Here, we may the underlying data is not heavy-tailed, and/or perhaps is bounded from below. This could be an indicator of potential fraud, or it could just reflect reality.

For example, see #31

1e100 commented 3 years ago

Interesting observation. I wonder what's up with that distribution, as well, although the most curious thing I have seen so far is #28. As I said I'm not a huge fan of Benford's - there is, by definition, no mathematical explanation for it as it relies on "real world" distributions, which a concept that can't be mathematically replicated. IMO if there was fraud (which I'm not claiming there was), it'll be easier to find the red flags in time series data, especially data arriving very late in the game.