Open nathanhammond opened 3 years ago
The purpose of this repo is to convert characters with old unicode encoding to new unicode encoding, in which the new unicode-encoded characters can be displayed properly in devices that supported the new unicode encoding standard.
ref: 〸 https://www.compart.com/en/unicode/U+3038 〹 https://www.compart.com/en/unicode/U+3039 〺 https://www.compart.com/en/unicode/U+303A 兀 https://www.compart.com/en/unicode/U+FA0C 嗀 https://www.compart.com/en/unicode/U+FA0D 𤧬 https://www.compart.com/en/unicode/U+249EC
The output from "+ From a new export" should be the one with newer unicode encoding. The previous unicode encoding for those words can also be found in "Decomposition" row from ref.
For the weird one, I think I need some time to investigate it.
Okay, I'm throwing content here so that I can try to figure it out later:
var normalizations = ["NFC", "NFD", "NFKC", "NFKD"];
var u5341 = normalizations.map((normalization) => "\u5341".normalize(normalization).codePointAt(0).toString(16));
var u3038 = normalizations.map((normalization) => "\u3038".normalize(normalization).codePointAt(0).toString(16));
console.log(normalizations.join('\t'));
console.log(u3038.join('\t'));
console.log(u5341.join('\t'));
ORIG NFC NFD NFKC NFKD
3038 3038 3038 5341 5341
5341 5341 5341 5341 5341
So, U+5341 is a "compatibility decomposition" of U+3038. You can get from U+3038 to U+5341, but not the other way around. I don't yet know what that means. Especially since there is also something known as a "canonical decomposition." My future reading:
Sorry, just realized that my previous comment could be read as passive aggressive. This is actually me trying to figure things out and you're seeing me pause and serialize my state so that I can spend the evening with my family. I'll continue tomorrow.
After further review, I'm pretty sure that U+5341 (and all of the others) should be the selected encoding. Reasoning:
This also seems to be in the intent of compatibility characters serving as the "base" value, without any additional formatting included. See below for concrete examples of limitations.
Example for generically identifying pronunciation of a string if this were stored at U+5341:
var lookup = {
"5341": "sap6"
};
function getPronunciation(character) {
return lookup[character.codePointAt(0).toString(16)];
}
var inputString = "\u3038";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ "sap6" ]
Example of failure if stored at U+3038:
var lookup = {
"3038": "sap6"
};
function getPronunciation(character) {
return lookup[character.codePointAt(0).toString(16)];
}
var inputString = "\u5341";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ undefined ]
The latter solution would require an in-application lookup from U+5341 to U+3038:
var lookup = {
"3038": "sap6"
};
var indirect = {
"5341": "3038"
};
function getPronunciation(character) {
var codePoint = character.codePointAt(0).toString(16);
return lookup[codePoint] || lookup[indirect[codePoint]];
}
var inputString = "\u5341";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ "sap6" ]
The weird one:
+ 浧 U+6D67 wun3
- 𤧬 U+249EC wun3
I traced it backward and it is happening because of a duplicate key in hkscs1999.tsv
:
https://github.com/chaklim/hkscs_unicode_converter/blob/9c397624bbefbefd0efe14e8a3215c2c0cb9ad70/hkscs/hkscs1999.tsv#L1720
The table you're using comes from here: https://moztw.org/docs/big5/ https://moztw.org/docs/big5/table/hkscs1999.txt
If you compare it to the values from the original 1999 HKSCS, you can see that 0x9447
shouldn't be mapped; it should be omitted.
https://www.ccli.gov.hk/doc/e_hkscs_1999.pdf
Alternatively, it should be mapped to the compat point: https://www.ccli.gov.hk/doc/big5cmp2001.txt
So, the presence of that duplicate value is an error in the source data file.
https://en.wikipedia.org/wiki/Suzhou_numerals
U+3038, U+3039, and U+303A are Suzhou numerals. For running text we should instead prefer U+5341, U+5EFF, and U+5345.
Here are details confirming that this library's mapping of U+5341/U+5344/U+5345 is incorrect.
These three were eventually elided from Big5, but help to communicate intent.
A2CC (Suzhou 10)
A2CD (Suzhou 20, "H" presentation)
A2CE (Suzhou 30)
These are intended to be used in running text.
A451 (Ideograph 10)
A4CA (Ideograph 30)
A4DC (Ideograph 20, "U" presentation)
The graphical variants (Suzhou).
U+3038 (Suzhou 10)
U+3039 (Suzhou 20, "H" presentation)
U+303A (Suzhou 30)
The ideograph variants (for running text).
U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)
U+5EFF (Ideograph 20, "U" presentation)
Unicode code points should not be remapped between these two sections; they're distinct. We should assume that the user has intentionally selected one or the other.
U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)
U+5EFF (Ideograph 20, "U" presentation)
hkscs_unicode_converter
Maps inputs of:
U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)
To Suzhou outputs:
U+3038 (Suzhou 10)
U+3039 (Suzhou 20, "H" presentation)
U+303A (Suzhou 30)
This set of facts makes it pretty clear to me that this should avoid remapping
Here is HKCS with explicit recommendations about how certain code points should be transformed: https://www.ccli.gov.hk/doc/HKCS_En_V10.pdf
These recommendations match to what I tracked down.
It's possible that this library is selecting the wrong encoding for some characters. In comparing the output from this library to the content of https://github.com/lshk-org/jyutping-table I've noticed the following discrepancies.
I believe that these issues should be resolved in this library, and that the other output is correct.
The below results are also included in a related issue filed at https://github.com/lshk-org/jyutping-table/issues/5
Further, there is a weird one:
From
JPTableFull.pdf
that is defined as:{ ucs2: "E6C5", jyutping: "wun3" }
.I do believe that
U+249EC
is the correct value here.