Feature request: Add a unicode encode option

brix / crypto-js

JavaScript library of crypto standards.

Other

15.83k stars 2.39k forks source link

Feature request: Add a unicode encode option #289

Open yhojann-cl opened 4 years ago

yhojann-cl commented 4 years ago

For example, from hex representation to unicode string:

CryptoJS.enc.Hex.parse('80').toString(CryptoJS.enc.Utf8);
// Error: Malformed UTF-8 data

But in native javascript code i can representate the \x80 in unicode string format:

console.log('\x80');
// "\u0080"
console.log(JSON.parse(JSON.stringify({a:'\x80'})))
// Object { a: "\u0080" }

The UTF-8 supports from 00 to FF as unicode values, but CryptoJS.enc.Utf8 support only a valid ascii character from 00 to 7F.

Please, add a unicode option for decode plain strings, like as:

CryptoJS.enc.Hex.parse('80').toString(CryptoJS.enc.Utf8Unicode);
// "\u0080"

yhojann-cl commented 4 years ago

I depure the source code of CryptoJS.enc.Utf8 in code.js:

var Utf8 = C_enc.Utf8 = {
    stringify: function (wordArray) {
        try {
            return decodeURIComponent(escape(Latin1.stringify(wordArray)));

Ok, lets trace using the depuration tab from firefox (developer tool):

stringify: function (wordArray) {
...
        return latin1Chars.join('');

The value of latin1Chars is \u0080, this works fine, but the problem is translate to ut8 using decodeURIComponent:

decodeURIComponent('7F')
// "\u007f"
decodeURIComponent('80')
// URIError: malformed URI sequence
encodeURIComponent('\x80')
// "%C2%80"

You can write a function to return the plain text without the decodeURIComponent(escape())?.

yhojann-cl commented 4 years ago

Solve this using this.CryptoJS.enc.Latin1, but Latin1 is not unicode, latin1 use a translation of 1 to 1 byte representation, but you can add the equivalent variable to the function, like as var Unicode = C_enc.Unicode = C_enc.Latin1;. By example, can not encode & decode the decimal character representation of 300 Ĭ (0xc4ac or 0x012c):

escape(String.fromCharCode(300));
// "%u012c"
CryptoJS.enc.Hex.stringify(CryptoJS.enc.Latin1.parse('Ĭ'));
// 2c
CryptoJS.enc.Hex.stringify(CryptoJS.enc.Latin1.parse(String.fromCharCode(300)));
// 2c
console.log(this.CryptoJS.enc.Hex.parse('2c').toString(this.CryptoJS.enc.Latin1).charCodeAt());
// 44 (,)
console.log('\u012c');
// Ĭ

AlttiRi commented 4 years ago

The UTF-8 supports from 00 to FF as unicode values, but CryptoJS.enc.Utf8 support only a valid ascii character from 00 to 7F.

FF is 255.

The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.

https://en.wikipedia.org/wiki/UTF-8

The Unicode char with the code 128 in UTF-8 is 2 bytes – 194, 128. Just 128 is not valid UTF-8.

new TextEncoder("utf8").encode(String.fromCharCode(128)) // Uint8Array(2) [194, 128]

let a = CryptoJS.enc.Hex.parse("0080")
console.log(a); // WordArray { words: [ 8388608 ], sigBytes: 2 }
a = CryptoJS.enc.Utf16.stringify(a)
console.log(a); // 
console.log(a.charCodeAt(0));  // 128