cjkvi / cjkvi-ids

IDS data for CJK Unified Ideographs
http://kanji-database.sourceforge.net/
403 stars 83 forks source link

IDS for 為 #14

Open glandium opened 7 years ago

glandium commented 7 years ago

The Kang Xi radical code for 為 is for 灬, so at the very least, it seems the IDS for it should be:

⿱⑤灬

More generally, there seems to be a few characters with no decomposition that should at least be decomposed into some IDC + some number + their 部首, where both the number and the 部首 can be derived from the kRSKangXi information from the UniHan DB... (although in some cases, kRSAdobe_Japan1_6 and/or kRSUnicode can have different values, but I'm not sure if that's the case for the characters that currently have no decomposition)

I guess I could run a script to cross-check the UniHan DB vs. the characters in the idx.txt file that have the same thing in both columns 2 and 3 (i.e. have no decomposition).

glandium commented 7 years ago

A few examples of such characters with no decomposition that I found manually while reviewing the 常用漢字 list:

未末本来東

All of them have 木 as radical.

Another one that I found because it's used in the decomposition of some of the 常用漢字:

Its radical is 禸

hfhchan commented 7 years ago

The current decomposition strictly follows overlapping rules, so 未末本来東禺 cannot be decomposed.

The IDS provided by cjkvi-ids is mainly for Unicode/IRG's indexing and duplicate detection purposes, so it would treat easily recognizable characters as a single unit. Theoretically, 為 is definitely decomposeable to ⑤灬, however that top part is not used by any other character, and would unlikely so be used, so there makes no use for decomposing it that way.

Additional decomposed IDS may be useful for some other purposes, but I guess not as far as this repository is concerened.

glandium commented 7 years ago

Where can I read about those overlapping rules?

kawabata commented 7 years ago

Dear hfchan, thanks for the comments, which is what I wanted to say. For general rules on the usage of IDS is described in the Appendix I of 10646, but it just says "The IDS introduced by this character describes the abstract form of the ideograph with D1 and D2 overlaying each other.".

glandium commented 7 years ago

Indeed, there is nothing in ISO-10646 Annex I that tells that e.g. ⿻木一 can't be used to describe 未, 末 or 本.

glandium commented 7 years ago

Anyways, I guess hfhchan's comment means I should create a separate IDS database for my own purpose, although I guess my usecase kind of overlaps with KanjiVG.

hfhchan commented 7 years ago

@glandium I personally think it's not effective for IRG's purposes, since using ⿻木一 to describe 未, 末 or 本 means that now 沫 and 泍 will indicate a match (which is most likely a false positive). Anyhow, this repository is currently authored by Kawabata-san's, so he is in the best position to decide which degree of accuracy is preferred.

I do maintain my own set of mappings when I differ from Kawabata-san's judgement, so I have completely no problem with Kawabata-san changing any rules to fit different use-cases :)