Possible Encoding Issues

nathanhammond commented 3 years ago

It's possible that this library is selecting the wrong encoding for some characters. In comparing the output from this library to the content of https://github.com/lshk-org/jyutping-table I've noticed the following discrepancies.

I believe that these issues should be resolved in this library, and that the other output is correct.

The below results are also included in a related issue filed at https://github.com/lshk-org/jyutping-table/issues/5

- From the original export, present in `list-20040907.tsv`
+ From a new export, https://github.com/nathanhammond/parse-jyutping-table-full/blob/master/totsv.js

- 十 U+5341  sap6
+ 〸 U+3038  sap6
- 卄 U+5344  jaa6
- 卄 U+5344  je6
- 卄 U+5344  lim6
- 卄 U+5344  nim6
+ 〹 U+3039  jaa6
+ 〹 U+3039  je6
+ 〹 U+3039  lim6
+ 〹 U+3039  nim6
- 卅 U+5345  saa1 aa6
+ 〺 U+303A  saa1 aa6
- 兀 U+5140  at6
- 兀 U+5140  ngat6
+ 兀 U+FA0C  at6
+ 兀 U+FA0C  ngat6
- 嗀 U+55C0  hok3
+ 嗀 U+FA0D  hok3

Further, there is a weird one:

+ 浧 U+6D67  wun3
- 𤧬 U+249EC wun3

From JPTableFull.pdf that is defined as: { ucs2: "E6C5", jyutping: "wun3" }.

I do believe that U+249EC is the correct value here.

chaklim commented 3 years ago

The purpose of this repo is to convert characters with old unicode encoding to new unicode encoding, in which the new unicode-encoded characters can be displayed properly in devices that supported the new unicode encoding standard.

ref: 〸 https://www.compart.com/en/unicode/U+3038 〹 https://www.compart.com/en/unicode/U+3039 〺 https://www.compart.com/en/unicode/U+303A 兀 https://www.compart.com/en/unicode/U+FA0C 嗀 https://www.compart.com/en/unicode/U+FA0D 𤧬 https://www.compart.com/en/unicode/U+249EC

The output from "+ From a new export" should be the one with newer unicode encoding. The previous unicode encoding for those words can also be found in "Decomposition" row from ref.

For the weird one, I think I need some time to investigate it.

nathanhammond commented 3 years ago

Okay, I'm throwing content here so that I can try to figure it out later:

var normalizations = ["NFC", "NFD", "NFKC", "NFKD"];
var u5341 = normalizations.map((normalization) => "\u5341".normalize(normalization).codePointAt(0).toString(16));
var u3038 = normalizations.map((normalization) => "\u3038".normalize(normalization).codePointAt(0).toString(16));

console.log(normalizations.join('\t'));
console.log(u3038.join('\t'));
console.log(u5341.join('\t'));

ORIG    NFC NFD NFKC    NFKD
3038    3038    3038    5341    5341
5341    5341    5341    5341    5341

So, U+5341 is a "compatibility decomposition" of U+3038. You can get from U+3038 to U+5341, but not the other way around. I don't yet know what that means. Especially since there is also something known as a "canonical decomposition." My future reading:

nathanhammond commented 3 years ago

Sorry, just realized that my previous comment could be read as passive aggressive. This is actually me trying to figure things out and you're seeing me pause and serialize my state so that I can spend the evening with my family. I'll continue tomorrow.

nathanhammond commented 3 years ago

After further review, I'm pretty sure that U+5341 (and all of the others) should be the selected encoding. Reasoning:

Every keyboard I've tested for entering 十 outputs the U+5341 code point.
Paired with downstream consumers of this, if a pronunciation is attached to U+3038 you can't use Unicode NFKC normalization to U+5341 to discover the pronunciation. The inverse, however, does work.

This also seems to be in the intent of compatibility characters serving as the "base" value, without any additional formatting included. See below for concrete examples of limitations.

Example for generically identifying pronunciation of a string if this were stored at U+5341:

var lookup = {
  "5341": "sap6"
};

function getPronunciation(character) {
  return lookup[character.codePointAt(0).toString(16)];
}

var inputString = "\u3038";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ "sap6" ]

Example of failure if stored at U+3038:

var lookup = {
  "3038": "sap6"
};

function getPronunciation(character) {
  return lookup[character.codePointAt(0).toString(16)];
}

var inputString = "\u5341";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ undefined ]

The latter solution would require an in-application lookup from U+5341 to U+3038:

var lookup = {
  "3038": "sap6"
};

var indirect = {
  "5341": "3038"
};

function getPronunciation(character) {
  var codePoint = character.codePointAt(0).toString(16);
  return lookup[codePoint] || lookup[indirect[codePoint]];
}

var inputString = "\u5341";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ "sap6" ]

nathanhammond commented 11 months ago

The weird one:

+ 浧 U+6D67  wun3
- 𤧬 U+249EC wun3

I traced it backward and it is happening because of a duplicate key in hkscs1999.tsv: https://github.com/chaklim/hkscs_unicode_converter/blob/9c397624bbefbefd0efe14e8a3215c2c0cb9ad70/hkscs/hkscs1999.tsv#L1720

The table you're using comes from here: https://moztw.org/docs/big5/ https://moztw.org/docs/big5/table/hkscs1999.txt

If you compare it to the values from the original 1999 HKSCS, you can see that 0x9447 shouldn't be mapped; it should be omitted. https://www.ccli.gov.hk/doc/e_hkscs_1999.pdf

Alternatively, it should be mapped to the compat point: https://www.ccli.gov.hk/doc/big5cmp2001.txt

So, the presence of that duplicate value is an error in the source data file.

nathanhammond commented 11 months ago

https://en.wikipedia.org/wiki/Suzhou_numerals

U+3038, U+3039, and U+303A are Suzhou numerals. For running text we should instead prefer U+5341, U+5EFF, and U+5345.

nathanhammond commented 11 months ago

Here are details confirming that this library's mapping of U+5341/U+5344/U+5345 is incorrect.

Big5 Graphical Block (0xA140 to 0xA3BF)

These three were eventually elided from Big5, but help to communicate intent.

A2CC (Suzhou 10)
A2CD (Suzhou 20, "H" presentation)
A2CE (Suzhou 30)

Big5 Frequently used characters (0xA440 to 0xC67E)

These are intended to be used in running text.

A451 (Ideograph 10)
A4CA (Ideograph 30)
A4DC (Ideograph 20, "U" presentation)

Unicode

The graphical variants (Suzhou).

U+3038 (Suzhou 10)
U+3039 (Suzhou 20, "H" presentation)
U+303A (Suzhou 30)

The ideograph variants (for running text).

U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)
U+5EFF (Ideograph 20, "U" presentation)

Unicode code points should not be remapped between these two sections; they're distinct. We should assume that the user has intentionally selected one or the other.

JPTableFull.pdf - Specifies Unicode code points in the ideograph range.

U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)
U+5EFF (Ideograph 20, "U" presentation)

`hkscs_unicode_converter`

Maps inputs of:

U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)

To Suzhou outputs:

U+3038 (Suzhou 10)
U+3039 (Suzhou 20, "H" presentation)
U+303A (Suzhou 30)

This set of facts makes it pretty clear to me that this should avoid remapping

nathanhammond commented 10 months ago

Here is HKCS with explicit recommendations about how certain code points should be transformed: https://www.ccli.gov.hk/doc/HKCS_En_V10.pdf

These recommendations match to what I tracked down.

chaklim / hkscs_unicode_converter