cjph8914 / 2020_benfords

368 stars 82 forks source link

Analyzing the second leading digit makes all the results look non-conforming #10

Open dogweather opened 3 years ago

dogweather commented 3 years ago

This is really astonishing, and I want to make sure I didn't make some kind of simple mistake. I'm new to Pandas.

I took your notebook & data, and made the changes on Kaggle: https://www.kaggle.com/dogweather/allegheny-cty-benford-s

I suspect I'm not applying Benford's Law correctly. I.e., it doesn't apply to simply the second digit being a 2, but rather e.g. the number starting with 12.

zigster64 commented 3 years ago

what you should see looking at anything but the 1st digit is a relatively flat line, because the 2nd digit is distributed between 0 - 9, each with equal probability

Benfords Law only works with the 1st digit, but only for numbers that should be evenly distributed across a wide range.

Consider a set of numbers between 0 - 200

From 1-10, they are evenly distribted From 11-19 ... more than half begin with 1 by the time it gets to 100, you are starting to get a curve once it clicks over 100 - the next 100 numbers all start with 1, boosting the probability of a 1 ... etc

As the range of numbers gets big, the curve gets more certain

If you try and dodgy up the numbers for a vote, just making them up, etc - changing 200 votes to 500 votes sort of thing, then plotting the "Benford Law" graph if you will, will typically show a big bump in the middle of the graph where it shouldnt be.

Keep in mind, we are just tossing numbers and probabilities around here. The small data sets we are looking at here more than likely show that something is amiss with some of the numbers. There is no calculation with the % chance of fraud at this stage.

Its just pretty suss though that the only things that show these "tampering patterns" on the 1st digit just happen to have several other unrelated factors in common. (leaving that as an exersize for the reader)

More data, more analysis needed before we reach for the rope :)

Keep in mind that if you go looking for conclusions in any data set, there are ways of finding them to confirm any preconceived notion. So consider both proofs and disproofs when looking at results.

Given the stakes here, its good to remain detached from the outcome for a while, and let the cold hard numbers tell the story. My 2c

dogweather commented 3 years ago

Given the stakes here, it's good to remain detached from the outcome for a while, and let the cold hard numbers tell the story.

I agree 100%. When I saw my results, I assumed I made a mistake somewhere. FYI, Benford's does also apply to the second and other digits, but likely not in the way I coded it:

zigster64 commented 3 years ago

Cool, thats interesting about digits other than the 1st - will have to read up more on that.

I note that it works in bases other than 10 as well - so im interested in slicing and dicing the data to look at the same values using different bases (like base2 - base16) and seeing if there is a consistent trend.

There is some good discussion in another issue here that suggests the Benford number isnt entirely valid for some smaller sets, because, say Dems are getting 60% in a small collection of counties of few ppl, so the numbers are skewed heavily in the 200-300 range.

Another approach is to just multiply the raw values by some constant to put them into a larger range - whatever the value is, the Benford graph should look similar.