Add NFKD normalization + Fix for Japanese

dabura667 commented 8 years ago

This library was not using normalization at all, which for Spanish and Japanese would have produced invalid seeds. (So anyone using this library to generate wallets from Spanish or Japanese phrases must use the old non-normalizing version to recover their funds. I doubt anyone is there... but a warning might be necessary? or maybe create a new function for generating non-normalized seed?)

Chinese and English were ok, as their wordlists were pre-normalized (so unorm.nfkd(words) == words) so they should be fine as is.

Also, I added in one change for Japanese, mentioned on the BIP39: Japanese must be shown to the user being separated by an ideographic space. This is crucial to ensure users don't accidentally view 2 words as 1 word.

Ex.

"a part" // this is a normal ASCII space
"a　part" // this is an ideographic space

It doesn't seem that necessary when letters are so small, but looking at Japanese.

"あさ ごはん" // this is a normal ASCII space
"あさ　ごはん" // this is an ideographic space

It makes a huge difference, and the latter must be shown to the user.

Also notice that ideographic space will be replaced by ASCII space when NFKD normalized, so while the words themselves are NFKD in the wordlist, because Japanese requires non-NFKD ideographic spaces for the "phrase" string, I have placed a catch-all NFKD in the call to pbkdf2 around this.phrase

Users in Japan will also likely input the phrase using ideographic spaces to input it, so I NFKD the mnemonic input to outward facing functions.

dabura667 commented 8 years ago

My main motivation for this fix is because it seems like Copay might use BIP39 in backups some way, and I was checking to see the ideographic space usage and luckily I did, as Spanish would have been generating bad wallets (non-BIP39-standard).

Also note that Japanese phrases are unique in that ALL Japanese characters are "breakable" and thus textwrap will break a word in the middle on a line break. Care must be taken to ensure a word is not broken on the line break and shown to the user.

Reference: (breadwallet is still trying to get it right... it is difficult) https://github.com/voisine/breadwallet/issues/231 https://github.com/voisine/breadwallet/commit/dd1bfae75409009f087a72c3d7bd7a10fa845570

dabura667 commented 8 years ago

Wait a sec, I will add Japanese test vectors.

dabura667 commented 8 years ago

I tried to alter the test vectors to also use a Japanese passphrase with pbkdf2 instead of just "TREZOR" so that it could test normalizing of passphrase as well.

I realized how long it will take, and I don't have enough time, so I will just submit this PR as is.

Here are the Japanese test vectors I have prepared: non-normalized strings: https://raw.githubusercontent.com/bip32JP/bip32JP.github.io/6f6090b49bb718711904468bce99a73770e09071/test_JP_BIP39.json normalized (except for spaces) strings: https://raw.githubusercontent.com/bip32JP/bip32JP.github.io/377f72c5087533c34c79ba02335d1fbc5509dfa5/test_JP_BIP39.json

matiu commented 8 years ago

LGTM, great work.

pnagurny commented 8 years ago

LGTM

bitpay / bitcore-mnemonic

Add NFKD normalization + Fix for Japanese #27