U+301C and U+FF5E are not correctly mapped in EUC-JP/Shift_JIS/CP932

raccy commented 7 years ago

WAVE DASH U+301C 〜 and FULLWIDTH TILDA U+FF5E ～ have almost the same glyph, but different code points. WAVE DASH(1-33 on JISX0208) should be mapped to U+301C, but the iconv-lite maps it to U+FF5E. The mappings are incorrect in EUC-JP, Shift_JIS, and CP932.

Convert with iconv-lite

Unicode	->	EUC-JP	->	UNICODE
U+301C	->	3F(no map)
U+FF5E	->	8F A2 B7	->	U+FF5E
		A1 C1	->	U+FF5E

Unicode	->	Shift_JIS/CP932	->	Unicode
U+301C	->	3F(no map)
U+FF5E	->	81 60	->	U+FF5E

Convert with libiconv

Unicode	->	EUC-JP	->	Unicode
U+301C	->	A1 C1	->	U+301C
U+FF5E	->	8F A2 B7	->	U+FF5E

Unicode	->	Shift_JIS	->	Unicode
U+301C	->	81 60	->	U+301C
U+FF5E	->	(no map)

Unicode	->	CP932	->	Unicode
U+301C	->	81 60	->	U+301C
U+FF5E	->	81 60

ashtuchkin commented 7 years ago

Hey raccy, thanks for filing this issue.

In multibyte encodings, iconv-lite tries its best to mirror the WHATWG Encoding Standard. I just checked it out and it maps symbol 1-33 to U+FF5E, see this and this.

Do you have other sources except libiconv that map 1-33 to U+301C? You might want to file an issue to encoding standard issue tracker. I see there's some minor discussion there about it.

I can probably add the encoding pair U+301C -> 81 60 for Shift_JIS and CP932 to be more flexible, but for the decoding part I currently aim to follow encoding standard.

What do you think?

ikedas commented 7 years ago

Hi ashtuchkin,

raccy is right. U+FF5E is a mapping according to Microsoft Code Page (cp932) which is not authorized by public standards body. U+301C is the mapping according to Japan Industrial Standard (JIS X 0208).

Shift_JIS would be better to conform to JIS X 0208: Detailed encoding scheme is defined in Annex 1 of this standard.
EUC-JP would be better to conform to eucjp-ascii defined by OSF/JVC. Though it is not a national standard, it is identical to x-eucjp-open-19970715-ascii listed in XML Japanese Profile.

Even more characters are also given imcompatible mappings over two mappings above. It is quite a mess for japanese users. If you prefer, I'd like to provide changes.

ashtuchkin commented 7 years ago

Thanks for chiming in, Ikedas. What do you think of discussion of the same issue at the encoding standard tracker: https://github.com/whatwg/encoding/issues/47 ?

Note to self: Ambiguities can be see here: https://www.w3.org/TR/2000/NOTE-japanese-xml-20000414/#ambiguity_of_yen

ikedas commented 7 years ago

takahashim's suggestion looks reasonable for me. Current index-jis0208.txt would be renamed to index-windows31j.txt or similar. Appropriate names would be assigned to appropriate mappings.

(Problem on indices beyond 8836 (94 × 94) would be separate matter. They are simply beyond the domain of definition for CCS by ISO/IEC, i.e. domain of extension by vendors.)

On ambiguity, several implimentations adds one-way (Unicode to legacy) mappings for non-standard encoding, e.g. U+2015 HORIZONTAL BAR to \xA1\xBD EM DASH, therefore roundtrip conversion between cp932-based and JIS-based mappings is more or less satisfied.

(Addition) As takahashim pointed out, mapping defined by JIS X 0213 is rarely used in practice. It's an extension to JIS X 0208 but not compatible.

raccy commented 7 years ago

Thank you for your reply, ashtuchkin.

I found these files.

ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0212.TXT

These are in the OBSOLETE directory, but libiconv probably used these maps.

I don't object to iconv-lite being based on WHATWG Encoding Standard, and I think that is a good policy. But I think that there are two problems.

iconv-lite is different from the behavior of node-iconv. This confuse us. (See the code below, and run)
JIS X 0208 takes precedence over JIS X 0212 (beginnig with 8F), and depending on the implementation, JIS X 0212 may not be supported by EUC-JP.

This code is check enocde/decode with icnov-lite and node-iconv.

const Iconv = require('iconv').Iconv;
const lite = require('iconv-lite');

const unicodePoint = s => 'U+' + s.charCodeAt(0).toString(16).toUpperCase();
const bufferString = buf => {
  let s = '[ '
  for (const b of buf) {
    s += b.toString(16).toUpperCase();
    s += ' ';
  }
  s += ']';
  return s;
};
const up = (m, s) => console.log(m + ' : ' + unicodePoint(s));
const bp = (m, b) => console.log(m + ' -> ' + bufferString(b));
const ubp = (m, b) => up(m, b.toString('utf8'));

const waveDash = '\u301C';
const fwTilde = '\uFF5E';

const eucjp_A1C1 = new Buffer([0xA1, 0xC1]); // JISX0208 1-33 on EUC-JP
const eucjp_8FA2B7 = new Buffer([0x8F, 0xA2, 0xB7]); // JISX0212 1-23 on EUC-JP
const sjis_8160 = new Buffer([0x81, 0x60]); // JISX0208 1-33 on Shift_JIS

console.log('---- Unicode ----');
up('WAVE DASH', waveDash);
up('FULLWIDTH TILDE: ', fwTilde);

console.log();
console.log('---- iconv-lite ----');
up('EUC-JP A1 C1', lite.decode(eucjp_A1C1, 'eucjp'));
up('EUC-JP 8F A2 B7', lite.decode(eucjp_8FA2B7, 'eucjp'));
bp('WAVE DASH to EUC-JP', lite.encode(waveDash, 'eucjp'));
bp('FULLWIDTH TILDE to EUC-JP', lite.encode(fwTilde, 'eucjp'));
console.log();
up('Shift_JIS 81 60', lite.decode(sjis_8160, 'shift_jis'));
bp('WAVE DASH to Shift_JIS', lite.encode(waveDash, 'shift_jis'));
bp('FULLWIDTH TILDE to Shift_JIS', lite.encode(fwTilde, 'shift_jis'));
console.log();
up('CP932 81 60', lite.decode(sjis_8160, 'cp932'));
bp('WAVE DASH to CP932', lite.encode(waveDash, 'cp932'));
bp('FULLWIDTH TILDE to CP932', lite.encode(fwTilde, 'cp932'));

console.log();
console.log('---- node-iconv ----');
const utf8_waveDash = Buffer.from(waveDash, 'utf8');
const utf8_fwTilde = Buffer.from(fwTilde, 'utf8');

const e2u_iconv = new Iconv('EUC-JP', 'UTF-8');
const u2e_iconv = new Iconv('UTF-8', 'EUC-JP');
ubp('EUC-JP A1 C1', e2u_iconv.convert(eucjp_A1C1));
ubp('EUC-JP 8F A2 B7', e2u_iconv.convert(eucjp_8FA2B7));
bp('WAVE DASH to EUC-JP', u2e_iconv.convert(utf8_waveDash));
bp('FULLWIDTH TILDE to EUC-JP', u2e_iconv.convert(utf8_fwTilde));
console.log();
const s2u_iconv = new Iconv('Shift_JIS', 'UTF-8');
const u2s_iconv = new Iconv('UTF-8', 'Shift_JIS');
ubp('Shift_JIS 81 60', s2u_iconv.convert(sjis_8160));
bp('WAVE DASH to Shift_JIS', u2s_iconv.convert(utf8_waveDash));
try {
  // Error: Illegal character sequence
  bp('FULLWIDTH TILDE to Shift_JIS', u2s_iconv.convert(utf8_fwTilde));
} catch (e) {
  console.log('FULLWIDTH TILDE to Shift_JIS <ERROR> ' + e.message);
}
console.log();
const c2u_iconv = new Iconv('CP932', 'UTF-8');
const u2c_iconv = new Iconv('UTF-8', 'CP932');
ubp('CP932 81 60', c2u_iconv.convert(sjis_8160));
bp('WAVE DASH to CP932', u2c_iconv.convert(utf8_waveDash));
bp('FULLWIDTH TILDE to CP932', u2c_iconv.convert(utf8_fwTilde));

ikedas commented 7 years ago

Mappings on unicode.org may not be compatible to other implementation, e.g. 0x815C / 0x213D is mapped to U+2015 HORIZONTAL BAR. Personally I believe mapping defined by JIS (it is only mapping publicly authorized by ISO/IEC 10646) should be referred, however, investigation on existing implimentations is useful.

I suggest that at least 10 mappings mentioned above would be checked (both on forward and reverse mappings) to compare implementations. Additionally, duplicate mappings such as U+2116 NUMERO SIGN (both JIS X 0208 and JIS X 0212 have it) would be considered.

ikedas commented 7 years ago

I compiled tables to help comparing implementations.

IMO, “Canonic” in the tables below would provide bi-directional conversion (from and to Unicode), while others would provide only reverse (from Unicode) or forward (to Unicode) conversion.
Note that tables below focuses on EUC-JP implementations. They are not necessarily applicable to Shift_JIS / cp932.

Following table shows vendor-dependent mappings. That is, beyond implementations, single code point on legacy character set can be mapped to multiple Unicode characters.

Canonic	Microsoft	JIS X 0208 Annex 5	Code Point
U+203E	U+FFE3	U+FFE3	A1B1
U+2014	U+2015		A1BD
U+301C	U+FF5E		A1C1
U+2016	U+2225		A1C2
U+2212	U+FF0D		A1DD
U+00A5	U+FFE5	U+FFE5	A1EF
U+00A2	U+FFE0		A1F1
U+00A3	U+FFE1		A1F2
U+00AC	U+FFE2		A2CC
U+00A6	U+FFE4		8FA2C3

Following table shows non-injective mappings. That is, beyond implementations, multiple code points on legacy character set will be mapped to single Unicode character.

Canonic	JIS X 0212	IBM/NEC ext.	Unicode
ADE2	8FA2F1	8FF4AC	U+2116
ADE4		8FF4AD	U+2121
ADB5		8FF3FD	U+2160
ADB6		8FF3FE	U+2161
ADB7		8FF4A1	U+2162
ADB8		8FF4A2	U+2163
ADB9		8FF4A3	U+2164
ADBA		8FF4A4	U+2165
ADBB		8FF4A5	U+2166
ADBC		8FF4A6	U+2167
ADBD		8FF4A7	U+2168
ADBE		8FF4A8	U+2169
A2E5		ADF5	U+221A
A2DC		ADF7	U+2220
A2C1		ADFB	U+2229
A2C0		ADFC	U+222A
A2E9		ADF2	U+222B
A2E8		ADFA	U+2235
A2E2		ADF0	U+2252
A2E1		ADF1	U+2261
A2DD		ADF6	U+22A5
ADEA		8FF4AB	U+3231

Note: ADxx, 8FF3xx and 8FF4xx are IBM/NEC extensions.
Current index-jis0208.txt by WHATWG lacks mapping for 8FF3xx and 8FF4xx defined by eucjp-open.

ashtuchkin / iconv-lite

U+301C and U+FF5E are not correctly mapped in EUC-JP/Shift_JIS/CP932 #145