freedomofpress / securedrop

GitHub repository for the SecureDrop whistleblower platform. Do not submit tips here!
https://securedrop.org/
Other
3.62k stars 686 forks source link

Support Diceware wordlists in multiple languages as part of i18n efforts #999

Open toholdaquill opened 9 years ago

toholdaquill commented 9 years ago

In addition to translating the SecureDrop interface (see issue #753), it would also be ideal to support Diceware wordlists in multiple languages. Since sources should memorize their codenames for maximum security, this will make it easier for non-English speakers to use SecureDrop. Currently there are Diceware wordlists available in a dozen or so languages, see:

http://world.std.com/~reinhold/diceware.html

Since a journalist never sees a source's codename, it would be ideal to allow a source to select a different language than the journalist's. For instance, a Turkish source could use SecureDrop in Turkish, receive a Turkish codename, but the English-speaking (say) journalist would use an English-language interface.

garrettr commented 9 years ago

Potentially useful: @micahflee has started translating Diceware wordlists as part of his Passphrases project.

toholdaquill commented 9 years ago

Garrett Robinson:

Potentially useful: @micahflee has started translating Diceware wordlists as part of his Passphrases project.

nice. :)

Question...

I notice that these wordlists are all basically ASCII:

$ file * catalan-diceware.wordlist: ASCII text dutch-diceware.wordlist: ASCII text english-diceware.wordlist: C++ source, ASCII text french-diceware.wordlist: ISO-8859 text german-diceware.wordlist: C source, Non-ISO extended-ASCII text italian-diceware.wordlist: ASCII text japanese-diceware.wordlist: ASCII text, with CRLF line terminators polish-diceware.wordlist: ASCII text securedrop.wordlist: C++ source, ASCII text swedish-diceware.wordlist: C source, ASCII text

If the goal is to help non-English speakers create strong passphrases that are easy to memorize, it is important that the words be orthographically correct. Accents and umlauts aren't decoration (although they sometimes seem like it to us English speakers), but are essential parts of the meaning. A Swede, for instance, might choose to remember the word "smörgåsbord"; but the ASCII equivalent "smorgasbord" simply isn't a word in Swedish.

Note also that most non-English users have keyboards with locale-specific layouts. For instance, I'm typing this on a Spanish-language keyboard with keys like ñ and so forth (which is actually a PITA for me as an English-speaker, but I digress). Standard locale-specific hotkeys (e.g. ' + e = é) make it easy to enter chars like á, é, í, ó, ú, etc. This aids memorability, but also adds a little bit of entropy--instead of just the 27 chars of the alphabet [[en_*] + ñ], you actually have thirty-odd utf-8 chars once you include the various accents and diereses.

So I think there's a choice to be made, how you'd like to proceed adding multiple language support. You can definitely use these lists now, knowing that suboptimal, at least in this case, is quite a bit better than nothing. ("You vant me to memorize a passphrase en zee Engleesh? Zut alors!")

Long term, though, I think the Western European lists should be converted to orthographically-correct utf-8, with unicode on the horizon for Asian language support.

I note that the Diceware Kit for other Languages includes this suggestion:

  1. If you wish to add letter combinations in your language that are not in the 26-character Roman alphabet, you of course may do so, but consider whether they will be available on all keyboards that your users will have.

I think this is well-intentioned but incorrect. Since the goal here is to offer users a dropdown ("Select your language"), each language choice should be optimized for users of that language.

Reviewing the Diceware Kit, I'm not seeing any programmatic way to generate these lists. Suck in a whole dictionary, hacking and slicing for string length and other regex requirements? Maybe. But that sounds like more work than building the list by hand, especially since a local speaker would need to review the list before use, anyway.

Let me know if I can be of further help with this.

tildelowengrimm commented 8 years ago

How hard is it to type diacritical marks on Tails?

toholdaquill commented 8 years ago

On Wed, Nov 18, 2015 at 03:31:13PM -0800, Tom Lowenthal wrote:

How hard is it to type diacritical marks on Tails?

That would depend on the keyboard the user has. A Spanish-speaker would likely have a Spanish-language keyboard, other languages would have locale-specific layouts, etc.

tildelowengrimm commented 8 years ago

Have you tested that, or are you supposing? I've never tried using a non en-us layout with Tails.

toholdaquill commented 8 years ago

On Fri, Nov 20, 2015 at 06:14:57PM -0800, Tom Lowenthal wrote:

Have you tested that, or are you supposing? I've never tried using a non en-us layout with Tails.

I own a laptop with a Spanish-language keyboard.

To replicate in Tails, go to Applications --> System Tools --> Preferences --> System Settings --> Region and Language --> Layouts --> click the '+' button --> select the new keyboard layout you'd like to use.

Tails only supports five display languages, but the keyboard can be configured to any layout you desire.

tildelowengrimm commented 8 years ago

:+1:

philou-felin commented 7 years ago

I agree with the original poster. I had a look at the “Radio-Canada” Secure Box (French Canada) just out of curiosity and noticed that the passphrase was all in English. I think I understand the rationale for SecureDrop creating the passphrase for the user, but it has to be in his/her native tongue.

KwadroNaut commented 7 years ago

Some of the languages on that diceware page contain too many problematic words, non-words etc. For dutch there's been some nice effort by @remko https://el-tramo.be/blog/diceware-nl/ https://github.com/remko/dicewords/ It could/should be combined with the tests run by the University of Ghent (http://woordentest.ugent.be/ and datasets here: http://crr.ugent.be/programs-data/word-prevalence-values).If it's better to split issues for localization of Diceware lists per localization, please just move this comment to a seperate one.

ghost commented 6 years ago

Note that there now is support for internationalized word lists (currently just French supported). For Arabic, it would be enough to add a ar.txt file in https://github.com/freedomofpress/securedrop/tree/develop/securedrop/wordlists . However the code must also be modified to support non-ascii words and that is a non trivial change.

eloquence commented 3 years ago

Note that curating and expanding these word lists is still desirable. It may also be useful to allow admins to configure the preferred language for newly generated journalist designations (which are drawn from a different set of wordlists, currently monolingual).

KwadroNaut commented 3 years ago

Good reminder. @remko updated his tools and wordlists, they can be reused for other languages too if there's a need for it. To my understanding what he produced (and updates) is MIT-licensed (https://github.com/remko/dicewords/blob/master/LICENSE ), the generation and collection of the list is based of the 'open taal' initiative, tl;dr if you're fine with it, I'll create a pull request to either include https://el-tramo.be/diceware/diceware-wordlist-8k-composites-nl.txt or re-do remko's work to generate another Dutch one.

nabla-c0d3 commented 3 years ago

Currently the PassphraseGeneratorexplicitly rejects non-ASCII words (to maintain existing behavior):

https://github.com/freedomofpress/securedrop/blob/develop/securedrop/passphrases.py#L59

However, this check can probably be replaced with a check for encode("utf-8") without any problem.

eloquence commented 3 years ago

For the sprint starting 4/15, @rmol has committed to sharing a first set of wordlists generated using machine translation, so we can begin evaluating the quality of the results and potentially prepare integration.