ashtuchkin / iconv-lite

Convert character encodings in pure javascript.
MIT License
3.08k stars 281 forks source link

[EUC-JP] U+4FFF(俿) is encoded to IBM拡張文字(8FB1C8) instead of EUC-JP(F9BB) #285

Closed mercury233 closed 3 years ago

mercury233 commented 3 years ago
var iconvLite = require("iconv-lite")
const theChar = String.fromCharCode(0x4FFF);
const theEncodeResult = iconvLite.encode(theChar, 'EUC-JP');
const theDecodeResult1 = iconvLite.decode(Buffer.from([0x8F, 0xB1, 0xC8]), 'EUC-JP');
const theDecodeResult2 = iconvLite.decode(Buffer.from([0xF9, 0xBB]), 'EUC-JP');

console.log(theChar);
console.log(theEncodeResult);
console.log(theDecodeResult1);
console.log('------');
console.log(theDecodeResult2);
console.log(theDecodeResult1 === theDecodeResult2);

image

https://runkit.com/mercury233/6177adadef03d40008209995

As you can see, both 8FB1C8 and F9BB can be decoded, but it can't be encoded correctly.

ashtuchkin commented 3 years ago

Thanks for the runkit link! I see "俿" is encoded as <8F, B1, C8> (theEncodeResult), what do you mean it can't be encoded correctly? Is this encoding incorrect?

mercury233 commented 3 years ago

I know very few about character encoding, and I found the EUC-JP code of "俿" may be F9BB, and iconv-lite do can decode it

ashtuchkin commented 3 years ago

Well, honestly, I don't know much about EUC-JP either :) Current behavior seems reasonable, so I'm not sure what to do here. Let me know if you learn anything more specific (ideally with a link to some kind of standard), I can then reopen the issue. Thanks!