berzerk0 / Probable-Wordlists

Version 2 is live! Wordlists sorted by probability originally created for password generation and testing - make sure your passwords aren't popular!
Creative Commons Attribution Share Alike 4.0 International
8.68k stars 1.61k forks source link

De-duplicate items #2

Closed Miserlou closed 7 years ago

Miserlou commented 7 years ago

Looks like there could be quite a few dupes in here, for instance, "password" is at 1 and 19: https://github.com/berzerk0/Probable-Wordlists/blob/master/Real-Passwords/WPA-Length/Top76-probable-WPA.txt

Miserlou commented 7 years ago

Good project though!

Would love to see a list of WPA-formatted passwords that come just from router/wifi sources, not user-passwords.

berzerk0 commented 7 years ago

Duplication - this is me getting caught with the classic invisible newline between windows and Linux. Rev 1.1 will have this fixed in the main files, the Chunk files will take longer.

WPA-formatted sources - I have found Wordlists that include "WPA" in the title, but that isn't much of a guarantee that they exclusively come from router/wifi sources.

It is also possible (and equally not possible, as I am asserting this with zero evidence) that the trends for common passwords do not change dramatically if they are used for a Router or for an email address. It seems just as likely to me that people see it as a generic "password" rather than "the Wifi password."

I'll see if I can find some sources with more background, but I have doubts.

EDIT Of course, today I went somewhere where the Guest Wifi password was "wireless guest"

WiseNerd commented 7 years ago

Easy fix for the dupes that worked for me was issuing:%s^M\+ in vim to kill the trailing blank space artifacts from windows, and then issuing uniq -u passfile.txt > cleanpassfile.txt. Cool project.

ghost commented 7 years ago

@WiseNerd So if you already fixed it, why not make a PR?

iancnorden commented 7 years ago

PR from me shortly for de-dupe. Great work.

berzerk0 commented 7 years ago

@iancnorden You're gonna beat me to the punch! I have the desktop chugging away, but won't be back to upload changes for a half day or so

iancnorden commented 7 years ago

Now it's a race! I had not realized the size, Git clone is still chugging away!

WiseNerd commented 7 years ago

@blobgo well my macbook's limited ddr2 memory would be neutered by sanitizing that entire thing, I fixed a small part mostly out of curiosity. But was hoping to save somebody some time nonetheless :)

iancnorden commented 7 years ago

De-dupes still running.

berzerk0 commented 7 years ago

Initial De-Dupes (up to ~30 Million Non-Spec and WPA) are done, looks like I can't do the big ones in parallel - probably done by tomorrow.

Or so I thought, they didn't come out right.

@WiseNerd I was using

awk '!seen[$0]++' hasDupes > doesntHaveDupes 

which I assumed started at the top and worked its way down, but then for one of the files it popped "password" out of the 2nd slot. No way.

uniq 

only works if two lines are next to one another, unfortunately.

I might just have to compile again from sources - unless @iancnorden 's experience comes up with a solid de-duping

iancnorden commented 7 years ago

Chewing on the folder with Top2Bill*

164/958 completed, started around 1400 eastern.

If curious, thanks to https://github.com/ltdenard ... and this will have to continue overnight at this rate.

for f in ls -lha .| tail -n+4 | awk '{print $10}'; do sort -u ${f} > /tmp/tmp1 && mv /tmp/tmp1 ./${f}; done;

palexhorse commented 7 years ago

Can all unique combinations be put into a new file, or do you just want the duplicates removed?

berzerk0 commented 7 years ago

For Rev 1.1 we aim to just remove the duplicates while otherwise preserving order. The "duplicates" are likely illusory, where there probably are invisible newline characters splitting them up. This has some effect on overall accuracy once they have been removed.

Rev 2.0 will have the newlines weeded out at the source, so this problem will not carry over.

berzerk0 commented 7 years ago

De-Duped Rev 1.1 is live now, but does not contain the largest files.

Rev 1.2 will, in torrents with compression.

Closing this in light of the release of 1.1 and the impending release of 1.2