cjkvi / cjkvi-ids

IDS data for CJK Unified Ideographs
http://kanji-database.sourceforge.net/
403 stars 83 forks source link

Incorporation of Other DB to identify Phono-semantic compound characters #84

Open DonaldTsang opened 5 years ago

DonaldTsang commented 5 years ago

It would be good to use this as a way of addressing the phonetic elements of a character. https://github.com/BYVoid/ytenx/blob/master/ytenx/sync/dciangx/DrienghTriang.txt

hfhchan commented 5 years ago

The data used in this dataset is primarily shape-based, as shape-based fuzzy matching is used for IRG (Ideographic Rapporteur Group) uses. Unfortunately, phono-semantic information is not required in the mechanical process of identifying possible duplicates.

DonaldTsang commented 5 years ago

@hfhchan it might be useful for cjkvi research if phono-semantics and other relations are drawn out more clearly, because sometimes I want to search what phonetic derivatives a character has, and not match characters with random structures.

garfieldnate commented 3 years ago

@DonaldTsang Could you explain the structure of that file a bit? Which field provides the phonetic element?

DonaldTsang commented 3 years ago

The 10th element or the "聲符" is the phonetic element. The other elements are mostly phonological comparisons of Cantonese and Mandarin. The semantic elements are not on the table, however, they can be inferred by comparing the phonetic elements and other elements of the character in question.

DonaldTsang commented 3 years ago

Also in regards to character alternate forms https://github.com/BYVoid/ytenx/blob/master/ytenx/sync/jihthex/JihThex.csv https://github.com/BYVoid/ytenx/blob/master/ytenx/sync/jihthex/ThaJihThex.csv (same characters of different forms on the same row)

garfieldnate commented 3 years ago

The first line of the file is for the character 愛, and the phonetic character for it is... 隊? 🤔 I don't see this character inside of 愛. Is it in reduced form somehow? What I was hoping for was a DB that would always tell me the phonetic component of a character, in a way that is recognizable for learning the pronunciation.

DonaldTsang commented 3 years ago

Some of the items from the 6th column are blank, so 曖 and 僾 both map to 愛. 扒 maps to 八.

garfieldnate commented 3 years ago

Thanks a bunch! This was super useful.

DonaldTsang commented 3 years ago

@garfieldnate It's all Lasagna.

garfieldnate commented 3 years ago

I haven't heard the term "lasagna" before. What does this mean?

DonaldTsang commented 3 years ago

Well, Garfield, you like it don't you? Consider that you sleep in a Lasagna tin.

BradKML commented 3 years ago

Regarding reconstruction of characters not on the list, there are ways to go about it.

Hunan or Xiang can be cross-checked through Nushu:

Cantonese can be extracted through https://github.com/jyutnet/cantonese-books-data and https://github.com/wordshk/yue_references Other Chinese dialects are pooled through https://github.com/laubonghaudoi/Chinese_Rime and https://github.com/edenau/phonological-mapping

Conversions of text forms are necessary.

BradKML commented 3 years ago

@garfieldnate thanks for doing this, hope this gets passed through in the repo