Open glandium opened 7 years ago
A few examples of such characters with no decomposition that I found manually while reviewing the 常用漢字 list:
未末本来東
All of them have 木 as radical.
Another one that I found because it's used in the decomposition of some of the 常用漢字:
禺
Its radical is 禸
The current decomposition strictly follows overlapping rules, so 未末本来東禺 cannot be decomposed.
The IDS provided by cjkvi-ids is mainly for Unicode/IRG's indexing and duplicate detection purposes, so it would treat easily recognizable characters as a single unit. Theoretically, 為 is definitely decomposeable to ⑤灬, however that top part is not used by any other character, and would unlikely so be used, so there makes no use for decomposing it that way.
Additional decomposed IDS may be useful for some other purposes, but I guess not as far as this repository is concerened.
Where can I read about those overlapping rules?
Dear hfchan, thanks for the comments, which is what I wanted to say. For general rules on the usage of IDS is described in the Appendix I of 10646, but it just says "The IDS introduced by this character describes the abstract form of the ideograph with D1 and D2 overlaying each other.".
Indeed, there is nothing in ISO-10646 Annex I that tells that e.g. ⿻木一 can't be used to describe 未, 末 or 本.
Anyways, I guess hfhchan's comment means I should create a separate IDS database for my own purpose, although I guess my usecase kind of overlaps with KanjiVG.
@glandium I personally think it's not effective for IRG's purposes, since using ⿻木一 to describe 未, 末 or 本 means that now 沫 and 泍 will indicate a match (which is most likely a false positive). Anyhow, this repository is currently authored by Kawabata-san's, so he is in the best position to decide which degree of accuracy is preferred.
I do maintain my own set of mappings when I differ from Kawabata-san's judgement, so I have completely no problem with Kawabata-san changing any rules to fit different use-cases :)
The Kang Xi radical code for 為 is for 灬, so at the very least, it seems the IDS for it should be:
More generally, there seems to be a few characters with no decomposition that should at least be decomposed into some IDC + some number + their 部首, where both the number and the 部首 can be derived from the kRSKangXi information from the UniHan DB... (although in some cases, kRSAdobe_Japan1_6 and/or kRSUnicode can have different values, but I'm not sure if that's the case for the characters that currently have no decomposition)
I guess I could run a script to cross-check the UniHan DB vs. the characters in the idx.txt file that have the same thing in both columns 2 and 3 (i.e. have no decomposition).