Radically / radically

A component-based CJK character search engine
https://radically.bryankok.com
GNU General Public License v2.0
11 stars 0 forks source link

Identify Japanese and Korean unsimplified, canonical characters #5

Open Transfusion opened 3 years ago

Transfusion commented 3 years ago

image

卫、衛、衞󠄀

Note that in Japan, https://www.kanjipedia.jp/kanji/0000403800 衞󠄀 is the 旧字 of 衛 (!!)

One cannot go hunting in the Unihan database directly since they are preexisting variants in G sources too - https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=U%2B885E

「說文解字」has https://dict.variants.moe.edu.tw/variants/rbt/word_attribute.rbt?quote_code=QTAyNzY4 眞, and furthermore goes on to say: 僊人變形而登天也。从从目从乚。 Korea and Japan consider this variant to be canonically traditional.

image

The case of 既 and 即 is strange in Japanese: they are 既 and 卽 respectively.

image

image

Transfusion commented 3 years ago

Unsimplified canonical Japanese variants are mostly available here https://github.com/cjkvi/cjkvi-variants/blob/e4f1da248c9737a243f9930b5dc497cef5d5ae16/jp-old-style.txt#L64-L69

Korean variants of the same nature are taken from the 1800 Hanja for Everyday Use

I consider variants of this nature (along with simplified / traditional chinese / the numerals / shinjitai in joyo kanji, radicals, etc) to be orthographic variants to ensure they are grouped together https://github.com/Transfusion/cjk-radical-search/blob/19d0d1b672d7a652bfcd6cc784dcd43ce7c669e1/etl/variants-fetcher.ts#L109

https://github.com/Transfusion/cjk-radical-search/blob/19d0d1b672d7a652bfcd6cc784dcd43ce7c669e1/etl/variants-fetcher.ts#L241-L260

TODO: investigate the 1800 korean hanja list and check whether any of them are not in the commonly used traditional chinese set, as I do not include them when computing orthographic variants, rather only in the expandVariantIslands function (TBD: discussion on what this does and the design issues faced)

https://github.com/Transfusion/cjk-radical-search/blob/19d0d1b672d7a652bfcd6cc784dcd43ce7c669e1/etl/genVariants.ts#L105-L116