cjkvi / cjkvi-ids

IDS data for CJK Unified Ideographs
http://kanji-database.sourceforge.net/
396 stars 83 forks source link

missing documentation for several files #99

Open garfieldnate opened 3 years ago

garfieldnate commented 3 years ago

Hi, thanks for maintaining this project! It's quite useful.

The readme does not document what the following files contain:

benkasminbullock commented 3 years ago

Hi, thanks for maintaining this project! It's quite useful.

The readme does not document what the following files contain:

  • hanyo-ids.txt

I don't know what this is but it seems to be related to this:

http://en.glyphwiki.org/wiki/Group:%E6%B1%8E%E7%94%A8%E9%9B%BB%E5%AD%90-40

  • ids-analysis.txt

It gives the origin of the characters or something, I'm not sure why it is in that order.

  • ids-ext-cde.txt

Should be ids-ext-cdef.txt. I think. It seems to be used in the construction of ids.txt:

$ grep 2A708 *
ids-cdp.txt:U+2A708 𪜈   ⿰丨⿱一七   ⿰丨乇
ids-ext-cdef.txt:U+2A708    𪜈   ⿰丨⿱一七   ⿰丨乇
ids.txt:U+2A708 𪜈   ⿰丨⿱一七   ⿰丨乇
ucs-strokes.txt:U+2A708 𪜈   4
  • ws2015-*
  • ucs-strokes.txt (though this one is easy to understand)
  • waseikanji-ids.txt (also easy to understand but only if you speak Japanese)

I guess with enough puzzling you could work out what they all do. It looks more like some of the computer programs which reformat these files are missing, rather than the documentation.

lxs602 commented 2 years ago

Hanyo = Hanyo Denshi

From the auto-translation below (from http://kanji-database.sourceforge.net/ids/ids.html), these appear to be characters and variants of non-Chinese (i.e. Japanese) origin:

Japanese non-Chinese character IDS data Japanese national character external character IDS data is the national character described in " General-purpose electronic information exchange environment maintenance program , character correspondence working committee material" dictionary non-published character "material (March 2008 National Institute for Japanese Language and Linguistics)" It is made into UCS / IDS, examples are quoted, and it is made into XML and organized.

The first column are Hanyo codes, which correspond to Unicode codes ('points') in the first column of the second link below.

When using the Unicode codes to look up the characters, they do appear to be less common/Japanese variants.

See also:

https://www.unicode.org/ivd/hanyo-denshi/ https://www.unicode.org/ivd/hanyo-denshi/Hanyo-Denshi_20120302_13045.txt

lxs602 commented 2 years ago

CDP = Chinese Document Processing (CDP) database developped by C.C. Hsieh and his team at Academia Sinica in Taipei, Taiwan.

CHISE-IDS and CDP appear to be similar projects, though CDP has Chinese characters only, while CHISE also has other scripts, e.g. Japanese.

When representing the breakdown of a character, it appears the CDP file shows, where, partial characters from the CDP project can be used instead.

See: https://www.freedesktop.org/wiki/Software/CJKUnifonts/Resources/Tutorial/

You can look up CDP characters at https://glyphwiki.org , e.g. cdp-8b6c - https://glyphwiki.org/wiki/cdp-8b6c

See also: CDP Homepage: https://cdp.sinica.edu.tw/ Mirror of CDP: https://github.com/caasi/cdphanzi

lxs602 commented 2 years ago

ids-ext-cdef

As the Unicode character database has expanded, less common characters have been added in stages, as extension blocks.

The ids file has breakdowns of all (?) the characters in the database. ids-ext-cdef has this only the additional characters found in blocks C, D, E and F.

See also: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_C https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_D https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_E https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_F

lxs602 commented 2 years ago

Ws2015-ids.txt

The Ideographic Description Sequences (IDS) for Wasei kango, which are "Japanese-made Chinese words".

GHZR codes correspond to dictionary codes from the GHZR dictionary (Google translation):

GHZR Chinese Dictionary Editorial Committee: "Chinese Dictionary (Second Edition)", Wuhan: Hubei Changjiang Publishing Group Chongwen Bureau & Chengdu: Sichuan Publishing Group Sichuan Lexicographic Publishing House, 2010, ISBN 978-7-5403-1744-7

GHZR 汉语大字典编辑委员会:《汉语大字典(第二版)》, 武汉: 湖北长江出版集团崇文书局 & 成都 : 四川出版集团四川辞书出版社 , 2010, ISBN 978-7-5403-1744-7

From: https://www.unicode.org/reports/tr38/tr38-31.html

lxs602 commented 2 years ago

ids.txt = Ideographic Description Sequences

...virtually all CJK ideographs can be broken down into smaller pieces that are themselves ideographs.

...Ideographic descriptions are... akin to the English phrase “an ‘e’ with an acute accent on it”

They were originally intended as part of the Unicode project, to describe characters (being very many) that had not yet been encoded. As the official documentation has noted, they are also useful for learning.

"There is no canonical description of unencoded ideographs"

They are subjective.

The 12 characters below are called Ideographic Description Characters ⿰ ⿱ ⿲ ⿳ ⿴ ⿵ ⿶ ⿷ ⿸ ⿹ ⿺ ⿻

From: https://www.unicode.org/versions/Unicode9.0.0/ch18.pdf#page=23

See also: https://en.wikipedia.org/wiki/Ideographic_Description_Characters_(Unicode_block)

lxs602 commented 2 years ago

IDS-analysis.txt

The second column is the character, and the first column is its Unicode.

The third column gives the ideographic description, same as for ids.txt, or else, a semantic variant of the character. Some characters have both, on separate lines, e.g.:

U+35C9⿱㓞各各聲1370031
U+35C9←㓵籒文1370031

The fourth column gives 六書 'six scripts' etymology (phonosemantic, ideograph, etc.) or other information such as speech parts (e.g. noun, adverb), written in Japanese.

The numbers in the fifth column correspond to the Shuowen Jiezi, an ancient dictionary, as indexed in the "Daxu edition by Zhonghua Book Company"

They numbers correspond to characters in this file: https://github.com/cjkvi/cjkvi-data/blob/master/swfont.txt

See: (Recommend using Google-translate, or a similar service) http://kanji-database.sourceforge.net/fonts/swfont/index.html

lxs602 commented 2 years ago

Wasei-kanji-ids.txt = "Japanese-made kanji dictionary / IDS data"

The numbers in this file correspond to Japanese kanji characters not yet encoded. By altering the link below, in particular the number at the end, you can look them up. e.g. 2616 - https://glyphwiki.org/wiki/waseikanji-no-jiten-2616

See: http://kanji-database.sourceforge.net/ids/waseikanji.html

lxs602 commented 2 years ago

General documentation for this project: http://kanji-database.sourceforge.net/

The Japanese version of the page (in the top right hand corner), is more complete than the English version, so it is recommended to use an auto-translation website.

lxs602 commented 2 years ago

Note for the maintainer of this project: You should not assume people unfamiliar with the project will understand without more comprehensive explanation.

Documentation should describe all the parts of the project in simple language.