Regularize the dictionary sizes for each language

bbusschots / hsxkpasswd

A Perl module and terminal command for generating secure memorable passwords inspired by the fabulous XKCD web comic and Steve Gibson's Password Hay Stacks. This is the library that powers www.xkpasswd.net

http://www.bartb.ie/xkpasswd

BSD 2-Clause "Simplified" License

277 stars 48 forks source link

Regularize the dictionary sizes for each language #13

Open bknowles opened 9 years ago

bknowles commented 9 years ago

So, I cloned the repo and checked out the code.

787853 lines for PT.pm? Seriously? Your program fails to do its job if the dictionary you're choosing from is too large for the humans to be able to immediately recognize and understand most of the words. Most people have a working vocabulary of about five to ten thousand words, so having a dictionary that is much more than ten thousand words is already stretching it a bit, but not too excessively much.

But three quarters of a million words?!? Even "huge" dictionaries only have on the order of ninety to a hundred thousand words. I can't imagine a dictionary that would have 750,000 words.

bbusschots commented 9 years ago

This is something I need help with from native speakers of the various languages.

I used the best (or is that least-bad) free and open source dictionary files I found online.

I'm not sure if it would be easier to start over with a different dictionary for each of the non-English languages, or if a native speaker could trim these existing dictionaries down to a more sane size.

Bottom line - this is very much on my radar, but, not something I can do without help from the community.

tflo commented 8 years ago

I’ve started working on a German word list, based mainly on the frami Hunspell dictionary and maybe with some additions from the WinEdit dictionary (the one that comes with HSXKPasswd). I’m aiming at something between 20.000 and 80.000 words.

Just one question: As far as I can tell the minimum word size of 4 chars is hardcoded. Are there plans to make this user-configurable in the future? If not, I’ll discard the shorter words.

Tom

bbusschots commented 8 years ago

@tflo fantastic - thanks!

There are no plans to allow words shorter than 4 letters, so you can safely ignore them.

tflo commented 8 years ago

I couldn’t spare too much time recently but I already filtered the four-, five- and six-letter words from the Hunspell dictionary. I’ll continue this way up to eight-letter words and I’ll add a reasonable amount of longer words, too. (Up to 12 letters or a bit more.) I will also add words from the WinEdit dictionary.

So far the results are not too shabby. (“Not too shabby” = easy to memorize.)

For example, what I just got with my 6-word (diceware-like) setting:

Urin:hupen:beste:Putin:Bombe:Toxin

I like that one ;-)

You can download my —draft— lists from this directory. They still contain the outcommented words. The current list has 7057 active words (only 4-, 5- and 6-letter words, up to now).

cmrd-senya commented 7 years ago

Isn't your English dictionary too small? It is about 1000 words. Is it enough to provide comparable combinations count with 8-character latin, digits and special symbols?

bbusschots commented 7 years ago

@cmrd-senya it could definitely do with being bigger. I'd be delighted to accept a pull request with a bigger one (preferably free of 'naughty' words of course).

mshulman commented 7 years ago

Maybe this? http://gcide.gnu.org.ua/download

But I glanced at the resulting dictionary, and it will take some work to clean this up to be a word list.

Or this: https://github.com/first20hours/google-10000-english It's the 10k most common English words. That multiplies your English entropy by log2(10) (if my math is right). And this list removes swear words: https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-no-swears.txt

That list does include 1, 2 and 3 letter words, but if you remove them, there are still 8,229 words. I'll extract that list and send you a PR.