ashtuchkin / iconv-lite

Convert character encodings in pure javascript.
MIT License
3.04k stars 282 forks source link

U+301C and U+FF5E are not correctly mapped in EUC-JP/Shift_JIS/CP932 #145

Open raccy opened 7 years ago

raccy commented 7 years ago

WAVE DASH U+301C and FULLWIDTH TILDA U+FF5E have almost the same glyph, but different code points. WAVE DASH(1-33 on JISX0208) should be mapped to U+301C, but the iconv-lite maps it to U+FF5E. The mappings are incorrect in EUC-JP, Shift_JIS, and CP932.

Convert with iconv-lite

Unicode -> EUC-JP -> UNICODE
U+301C -> 3F(no map)
U+FF5E -> 8F A2 B7 -> U+FF5E
A1 C1 -> U+FF5E
Unicode -> Shift_JIS/CP932 -> Unicode
U+301C -> 3F(no map)
U+FF5E -> 81 60 -> U+FF5E

Convert with libiconv

Unicode -> EUC-JP -> Unicode
U+301C -> A1 C1 -> U+301C
U+FF5E -> 8F A2 B7 -> U+FF5E
Unicode -> Shift_JIS -> Unicode
U+301C -> 81 60 -> U+301C
U+FF5E -> (no map)
Unicode -> CP932 -> Unicode
U+301C -> 81 60 -> U+301C
U+FF5E -> 81 60
ashtuchkin commented 7 years ago

Hey raccy, thanks for filing this issue.

In multibyte encodings, iconv-lite tries its best to mirror the WHATWG Encoding Standard. I just checked it out and it maps symbol 1-33 to U+FF5E, see this and this.

Do you have other sources except libiconv that map 1-33 to U+301C? You might want to file an issue to encoding standard issue tracker. I see there's some minor discussion there about it.

I can probably add the encoding pair U+301C -> 81 60 for Shift_JIS and CP932 to be more flexible, but for the decoding part I currently aim to follow encoding standard.

What do you think?

ikedas commented 7 years ago

Hi ashtuchkin,

raccy is right. U+FF5E is a mapping according to Microsoft Code Page (cp932) which is not authorized by public standards body. U+301C is the mapping according to Japan Industrial Standard (JIS X 0208).

Even more characters are also given imcompatible mappings over two mappings above. It is quite a mess for japanese users. If you prefer, I'd like to provide changes.

ashtuchkin commented 7 years ago

Thanks for chiming in, Ikedas. What do you think of discussion of the same issue at the encoding standard tracker: https://github.com/whatwg/encoding/issues/47 ?

Note to self: Ambiguities can be see here: https://www.w3.org/TR/2000/NOTE-japanese-xml-20000414/#ambiguity_of_yen

ikedas commented 7 years ago

takahashim's suggestion looks reasonable for me. Current index-jis0208.txt would be renamed to index-windows31j.txt or similar. Appropriate names would be assigned to appropriate mappings.

(Problem on indices beyond 8836 (94 × 94) would be separate matter. They are simply beyond the domain of definition for CCS by ISO/IEC, i.e. domain of extension by vendors.)

On ambiguity, several implimentations adds one-way (Unicode to legacy) mappings for non-standard encoding, e.g. U+2015 HORIZONTAL BAR to \xA1\xBD EM DASH, therefore roundtrip conversion between cp932-based and JIS-based mappings is more or less satisfied.


(Addition) As takahashim pointed out, mapping defined by JIS X 0213 is rarely used in practice. It's an extension to JIS X 0208 but not compatible.

raccy commented 7 years ago

Thank you for your reply, ashtuchkin.

I found these files.

These are in the OBSOLETE directory, but libiconv probably used these maps.

I don't object to iconv-lite being based on WHATWG Encoding Standard, and I think that is a good policy. But I think that there are two problems.

  1. iconv-lite is different from the behavior of node-iconv. This confuse us. (See the code below, and run)
  2. JIS X 0208 takes precedence over JIS X 0212 (beginnig with 8F), and depending on the implementation, JIS X 0212 may not be supported by EUC-JP.

This code is check enocde/decode with icnov-lite and node-iconv.

const Iconv = require('iconv').Iconv;
const lite = require('iconv-lite');

const unicodePoint = s => 'U+' + s.charCodeAt(0).toString(16).toUpperCase();
const bufferString = buf => {
  let s = '[ '
  for (const b of buf) {
    s += b.toString(16).toUpperCase();
    s += ' ';
  }
  s += ']';
  return s;
};
const up = (m, s) => console.log(m + ' : ' + unicodePoint(s));
const bp = (m, b) => console.log(m + ' -> ' + bufferString(b));
const ubp = (m, b) => up(m, b.toString('utf8'));

const waveDash = '\u301C';
const fwTilde = '\uFF5E';

const eucjp_A1C1 = new Buffer([0xA1, 0xC1]); // JISX0208 1-33 on EUC-JP
const eucjp_8FA2B7 = new Buffer([0x8F, 0xA2, 0xB7]); // JISX0212 1-23 on EUC-JP
const sjis_8160 = new Buffer([0x81, 0x60]); // JISX0208 1-33 on Shift_JIS

console.log('---- Unicode ----');
up('WAVE DASH', waveDash);
up('FULLWIDTH TILDE: ', fwTilde);

console.log();
console.log('---- iconv-lite ----');
up('EUC-JP A1 C1', lite.decode(eucjp_A1C1, 'eucjp'));
up('EUC-JP 8F A2 B7', lite.decode(eucjp_8FA2B7, 'eucjp'));
bp('WAVE DASH to EUC-JP', lite.encode(waveDash, 'eucjp'));
bp('FULLWIDTH TILDE to EUC-JP', lite.encode(fwTilde, 'eucjp'));
console.log();
up('Shift_JIS 81 60', lite.decode(sjis_8160, 'shift_jis'));
bp('WAVE DASH to Shift_JIS', lite.encode(waveDash, 'shift_jis'));
bp('FULLWIDTH TILDE to Shift_JIS', lite.encode(fwTilde, 'shift_jis'));
console.log();
up('CP932 81 60', lite.decode(sjis_8160, 'cp932'));
bp('WAVE DASH to CP932', lite.encode(waveDash, 'cp932'));
bp('FULLWIDTH TILDE to CP932', lite.encode(fwTilde, 'cp932'));

console.log();
console.log('---- node-iconv ----');
const utf8_waveDash = Buffer.from(waveDash, 'utf8');
const utf8_fwTilde = Buffer.from(fwTilde, 'utf8');

const e2u_iconv = new Iconv('EUC-JP', 'UTF-8');
const u2e_iconv = new Iconv('UTF-8', 'EUC-JP');
ubp('EUC-JP A1 C1', e2u_iconv.convert(eucjp_A1C1));
ubp('EUC-JP 8F A2 B7', e2u_iconv.convert(eucjp_8FA2B7));
bp('WAVE DASH to EUC-JP', u2e_iconv.convert(utf8_waveDash));
bp('FULLWIDTH TILDE to EUC-JP', u2e_iconv.convert(utf8_fwTilde));
console.log();
const s2u_iconv = new Iconv('Shift_JIS', 'UTF-8');
const u2s_iconv = new Iconv('UTF-8', 'Shift_JIS');
ubp('Shift_JIS 81 60', s2u_iconv.convert(sjis_8160));
bp('WAVE DASH to Shift_JIS', u2s_iconv.convert(utf8_waveDash));
try {
  // Error: Illegal character sequence
  bp('FULLWIDTH TILDE to Shift_JIS', u2s_iconv.convert(utf8_fwTilde));
} catch (e) {
  console.log('FULLWIDTH TILDE to Shift_JIS <ERROR> ' + e.message);
}
console.log();
const c2u_iconv = new Iconv('CP932', 'UTF-8');
const u2c_iconv = new Iconv('UTF-8', 'CP932');
ubp('CP932 81 60', c2u_iconv.convert(sjis_8160));
bp('WAVE DASH to CP932', u2c_iconv.convert(utf8_waveDash));
bp('FULLWIDTH TILDE to CP932', u2c_iconv.convert(utf8_fwTilde));
ikedas commented 7 years ago

Mappings on unicode.org may not be compatible to other implementation, e.g. 0x815C / 0x213D is mapped to U+2015 HORIZONTAL BAR. Personally I believe mapping defined by JIS (it is only mapping publicly authorized by ISO/IEC 10646) should be referred, however, investigation on existing implimentations is useful.

I suggest that at least 10 mappings mentioned above would be checked (both on forward and reverse mappings) to compare implementations. Additionally, duplicate mappings such as U+2116 NUMERO SIGN (both JIS X 0208 and JIS X 0212 have it) would be considered.

ikedas commented 7 years ago

I compiled tables to help comparing implementations.

Following table shows vendor-dependent mappings. That is, beyond implementations, single code point on legacy character set can be mapped to multiple Unicode characters.

Canonic Microsoft JIS X 0208 Annex 5 Code Point
U+203E U+FFE3 U+FFE3 A1B1
U+2014 U+2015 A1BD
U+301C U+FF5E A1C1
U+2016 U+2225 A1C2
U+2212 U+FF0D A1DD
U+00A5 U+FFE5 U+FFE5 A1EF
U+00A2 U+FFE0 A1F1
U+00A3 U+FFE1 A1F2
U+00AC U+FFE2 A2CC
U+00A6 U+FFE4 8FA2C3

Following table shows non-injective mappings. That is, beyond implementations, multiple code points on legacy character set will be mapped to single Unicode character.

Canonic JIS X 0212 IBM/NEC ext. Unicode
ADE2 8FA2F1 8FF4AC U+2116
ADE4 8FF4AD U+2121
ADB5 8FF3FD U+2160
ADB6 8FF3FE U+2161
ADB7 8FF4A1 U+2162
ADB8 8FF4A2 U+2163
ADB9 8FF4A3 U+2164
ADBA 8FF4A4 U+2165
ADBB 8FF4A5 U+2166
ADBC 8FF4A6 U+2167
ADBD 8FF4A7 U+2168
ADBE 8FF4A8 U+2169
A2E5 ADF5 U+221A
A2DC ADF7 U+2220
A2C1 ADFB U+2229
A2C0 ADFC U+222A
A2E9 ADF2 U+222B
A2E8 ADFA U+2235
A2E2 ADF0 U+2252
A2E1 ADF1 U+2261
A2DD ADF6 U+22A5
ADEA 8FF4AB U+3231