berzerk0 / Probable-Wordlists

Version 2 is live! Wordlists sorted by probability originally created for password generation and testing - make sure your passwords aren't popular!
Creative Commons Attribution Share Alike 4.0 International
8.7k stars 1.61k forks source link

List not filtered properly #29

Open interlocuteur opened 7 years ago

interlocuteur commented 7 years ago

The 258Million list has not been filtered properly. It contain a lot of HTML tags like and .

berzerk0 commented 6 years ago

Guess this one slipped by me, do you have a specific example?

It's possible these are legitimately being used as passwords - but that's very unlikely.

interlocuteur commented 6 years ago

I don't have the file anymore but you can search for angled brackets "<" and ">"

berzerk0 commented 6 years ago

This is tricky. I can't be sure of the origin of those lines - they might be both html tags and passwords.

berzerk0 commented 6 years ago

For Release 2.0, I erred on the side of inclusivity.

Their are lines that look a lot like code, specifically html tags. The same is true for some email addresses. In many cases, these lines appeared in over 15 files in analysis, suggesting they are in fact passwords. This logic is not definitive, however.

All of the source files on the list were already published, so this information is already available to the internet. With this in mind, I opted to include these lines. Most questionable lines do not appear until the list is already quite large.

This issue will remain open and we'll meditate upon it.

berzerk0 commented 6 years ago

Troy Hunt's take on the problem.

Of course, it's possible people actually used these strings as passwords but applying a bit of Occam's Razor suggests that it's simply parsing issues upstream of this data set.

Frankly though, there's little point in removing a few million junk strings. It reduced the overall data size of [Troy's Pwned Passwords V2] by 0.69% and other than the tiny fraction of extra bytes added to the set, it makes no practical difference to how the data is used.

While it is highly likely that these aren't passwords, the very idea that they are not is based on assumption we have a good handle on what passwords are. This assumption, for the most part, is true.

However, INTENTIONALLY making passwords that don't look like passwords isn't without merit. I once worked at a company where we had reason to believe that keyloggers were installed on our systems. I had no idea what to with this information, but it really bothered me. To cope with this, I came up with an idea to use the on-screen keyboard to create a password that looked like a URL.

Certainly, I can't be the only one to come up with the idea of making a password that contains some sort of camouflage. It is still most definitely more likely that these are simple "upstream parsing" issues, including them has such a small impact on list performance. I say they are worth keeping.