KanjiVG / kanjivg

Kanji vector graphics
http://kanjivg.tagaini.net
Other
1.05k stars 180 forks source link

Possible fixes to the dataset #433

Closed Fulguritude closed 6 months ago

Fulguritude commented 6 months ago

I've been doing some mining on this dataset (thanks for your awesome work btw, this is one of the cleanest datasets I've ever had the pleasure of working with) to try and map kanji / radical dependencies into a DAG. To do this, I only took the kanji and its list of "first order dependencies" (most composite radicals; lines with single tab and "element" in the svg).

I think I found a couple of things that might interest you ("needle in a haystack" kind of things). If they don't, I'm sorry for wasting your time !

1) Suggestions for kanji with non-standard (simple unicode) radical form

Here are the kanji I've found with some non-standard first-order radicals.

0539f 原 : CDP-8BC4 05c08 專 : CDP-8BD0 05f9e 從 : CDP-8BB0 066b9 暹 : ⿱日隹 09083 邃 : ⿱穴㒸

I suppose this choice might be because there's no exact unicode representation for these first-order radical combinations (looking at glyphwiki this seems to be the case).

However, I'm thinking that maybe the choice of CDP-8BC4 over, say, ⿱白小, lacks consistency. Since the "top-bottom specifier" is more visual, I'd opt for this, rather than the CDP code. Same remark for CDP-8BB0 vs. ⿱从龰 .

CDP-8BD0 seems a bit more ambiguous. Would the top part be considered a variation on 虫 ? Maybe 叀 (\u53c0) ?

2) Cyclical dependencies

053b6 厶 0738b 王 07adc 竜 have themselves as a dependency (cycle of length 1).

0620c 戌 0620d 戍 together, have a cyclical dependency of length 2.

These are the only cycles I've found (with an arbitrary length cycle search). I'm not sure if this is intentional or not.

3) List of non-representable characters in standard fonts

This one's just me looking for your advice, not a fix suggestion. I found the following, I was wondering if you knew of any font capable of handling these ?

('05694', ('嚔', '𤴡')), ('05e78', ('幸', '𢆉')), ('05f5c', ('彜', '𪪷')), ('07228', ('爨', '𤍾')), ('08f12', ('輒', '𦔮')), ('091d0', ('釐', '𠩺')), ('09441', ('鑁', '𡕰')), ('095dc', ('關', '𢆶')), ('09cf3', ('鳳', '𩾏')), ('09ece', ('黎', '𥝢')),

('07be6', ('篦', '𣬉')), ('08c94', ('貔', '𣬉')),

('050b7', ('傷', '𬀷')), ('05872', ('塲', '𬀷')), ('0616f', ('慯', '𬀷')), ('06ba4', ('殤', '𬀷')), ('08193', ('膓', '𬀷')), ('089f4', ('觴', '𬀷')),

('27491', ('𧒑', '虫')), ('27491', ('𧒑', '鼠')),

Thanks for your time, let me know if I can do anything else to help !

benkasminbullock commented 6 months ago

Please split your queries and ask them as separate issues. As it stands, it is extremely difficult to respond to you because you've asked three rather open-ended questions in one issue, and it means that this issue can never be successfully closed and regarded as dealt with.