berzerk0 / Probable-Wordlists

Version 2 is live! Wordlists sorted by probability originally created for password generation and testing - make sure your passwords aren't popular!
Creative Commons Attribution Share Alike 4.0 International
8.61k stars 1.61k forks source link

Suggestion: Statistics about popularity. #32

Closed Ho52198 closed 6 years ago

Ho52198 commented 6 years ago

Hello, Maybe I am wrong, but I have a feeling that a big number of all passwords, are "seen" only once of the different sources (if they are not just copy/upgrade of each other). Will be useful to have some general guididence like:

first milion - words seen between 200 to 20 times from 1000k to 10000k - words seen between 19 to 4 times From 10000k to 1000000k - words seen between 3 to 2 times from 100000k to the end - words seen 1 time only

This will give better understanding - where the probability stops, and random/alphabetically order starts.

For examble - even in the 120m wordlist I saw many passwords, that are obviously from random generator, and the chance to be used by many people or on many places is close to zero.

berzerk0 commented 6 years ago

This is already in place, and how the list sizes are determined. I'll make this information more prevalent.

From the ReadMe at https://github.com/berzerk0/Probable-Wordlists/tree/master/Real-Passwords

- I generated files by the number of times each line appeared in my analysis. Files are available for 75, 50, 25, 10, and 5 appearances.
- Top 196 - appeared at least 75 times - these are the MOST common passwords
- Top 3575 - appeared at least 50 times
- Top 95 Thousand - appeared at least 25 times 
- Top 32 Million - appeared at least 10 times
- Top 258 Million - appeared at least 5 times
- Top 2Billion - appeared at least 2 times

From the source files to make Rev 1, only 1/3 of the Passwords appeared more than once. Those lines don't make it on to this list. If it is only shown once, I can hardly call it "Probable".

It might be that some passwords appear random, and seem very unlikely to be used. However, if a line appeared in the files more than once - it ended up in the files. It's quite difficult, if not impossible, to reverse engineer the giant encyclopedic wordlists that form some of the source material. Odds are the random-looking lines near the bottom of the 2 billion list only appeared in one leak, but there isn't any way for me to know that.

I erred on the side of inclusivity - this time. I may make the minimum number of appearances needed for inclusion in Rev 3 five or three.