ashtuchkin / iconv-lite

Convert character encodings in pure javascript.
MIT License
3.08k stars 282 forks source link

EUC-JP Yen sign (¥) and overline (‾) decoded incorrectly #218

Open kshetline opened 5 years ago

kshetline commented 5 years ago

I see there's an earlier (much ealier, 2014!) bug about these two characters not being encoded correctly for Shift_JIS. Now it seems there's a problem with them being decoded correctly in EUC-JP, Shift_JIS, and Big5.

I changed shiftjis-test.js to add these two characters to the test:

    it("ShiftJIS correctly encoded/decoded", function() {
        var testString = "¥‾中文abc", //unicode contains ShiftJIS-code and ascii
            testStringBig5Buffer = new Buffer([0x5C, 0x7E, 0x92, 0x86, 0x95, 0xb6, 0x61, 0x62, 0x63]),
            testString2 = '測試',
            testStringBig5Buffer2 = new Buffer([0x91, 0xaa, 0x8e, 0x8e]);

        assert.strictEqual(iconv.encode(testString, "shiftjis").toString('hex'), testStringBig5Buffer.toString('hex'));
        assert.strictEqual(iconv.decode(testStringBig5Buffer, "shiftjis"), testString);
        assert.strictEqual(iconv.encode(testString2, 'shiftjis').toString('hex'), testStringBig5Buffer2.toString('hex'));
        assert.strictEqual(iconv.decode(testStringBig5Buffer2, 'shiftjis'), testString2);
    });

...and the second assert fails. The Yen sign and overline get decoded as if they were ASCII, as backslash and tilde. A similar failure occurs in big5-test.js when I add ¥ and to testString.

I wasn't sure where EUC-JP was tested specifically.

kshetline commented 5 years ago

I've looked into Shift_JIS more, and it's left me uncertain about how best to handle these conflicting characters.

While there's plenty of clear information about how ¥ and takeover for \ and ~, I haven't been able to find any clear statement about whether \ and ~ simply don't exist in Shift_JIS, or if there are alternate (probably multi-byte) encodings to handle these two displaced ASCII characters.

When I try to encode \ or ~ using node-iconv it throws an error.

Your iconv-lite encodes both ¥ and \ as 0x5C, and both and ~ as 0x7E.

Perhaps that is the best thing to do on the encoding side if there aren't proper encodings for \ and ~ (as users of Shift_JIS are apparently accustomed to these particular confusions, and these substitutions might provide more info to the user than treating \ and ~ as unknown characters), but in that case the decoding side should favor ¥ over \, and over ~ if \ and ~ don't have their own unique encodings.