UniversalDependencies / UD_Classical_Chinese-Kyoto

Other
28 stars 2 forks source link

a small suggestion on data split #8

Closed YuhuYang closed 1 year ago

YuhuYang commented 1 year ago

Ancient Chinese is not homogeneous internally. This characteristic is because we call all non-modern Chinese parts modern Chinese. Although there has been a long separation between spoken and written language in Chinese history, which has allowed written language to preserve its initial state to a certain extent, changes cannot be ignored. So it seems more appropriate to divide the data set into several subcategories according to the era, such as pre-Qin, Qin-Han, Wei-Jin, Tang-Song, Yuan, Ming and Qing Dynasties.

KoichiYasuoka commented 1 year ago

Thank you @YuhuYang for the suggestion but, in fact, we could not determine when the sentences had been written. We started UD_Classical_Chinese-Kyoto with 四書 (孟子, 論語, 大學, and 中庸) whose contents should have written in 秦 dynasty, but we scrutinized their sentences and had found that they were partially re-written in 唐 dynasty. Furthermore they often include older (pre-秦?) quotes with "詩曰" and so on. Then @YuhuYang how do we split the sentences into the subcategories?

YuhuYang commented 1 year ago

Thanks for your callback! What you talked is really a big problem, and maybe the era is not a good choice. Or the data can be split into different genres? Such as poems(楚辞、诗经、唐诗三百首),sutra(佛经),Classics(孟子, 論語, 大學, 中庸),history(十八史略). Maybe this strategy can be helpful for reducing the heterogeneity. Or give an additional branch which is consist of different books. Through it, people can choice the part they want continently. Just for suggestions. Thanks again!

KoichiYasuoka commented 1 year ago

Well @YuhuYang I'm vague whether your criteria work well, but newdoc id and sent_id would help you. For example, in the first lines of lzh_kyoto-ud-test.conllu:

# newdoc id = KR1h0004_001
# sent_id = KR1h0004_001_title
# text = 學而篇第一
1   學   學   VERB    v,動詞,行為,動作  _   3   acl _   Gloss=study|SpaceAfter=No
2   而   而   CCONJ   p,助詞,接続,並列  _   1   orphan  _   Gloss=and|SpaceAfter=No
3   篇   篇   NOUN    n,名詞,可搬,伝達  _   0   root    _   Gloss=section-of-a-book|SpaceAfter=No
4   第   第   NOUN    n,名詞,数量,*   _   3   list    _   Gloss=order-in-a-sequence|SpaceAfter=No
5   一   一   NUM n,数詞,数字,*   _   4   nummod  _   Gloss=one|SpacesAfter=\n

indicate that they are taken from https://www.kanripo.org/text/KR1h0004/001 and you can find that KR1h0004 is 論語 in 經部-四書類 in Kanripo.

YuhuYang commented 1 year ago

Thanks!

在 2023-09-11 09:29:26,"Koichi Yasuoka" @.***> 写道:

Well @YuhuYang I'm vague whether your criteria work well, but newdoc id and sent_id would help you. For example, in the first lines of lzh_kyoto-ud-test.conllu:

newdoc id = KR1h0004_001

sent_id = KR1h0004_001_title

text = 學而篇第一

1 學 學 VERB v,動詞,行為,動作 3 acl Gloss=study|SpaceAfter=No 2 而 而 CCONJ p,助詞,接続,並列 1 orphan Gloss=and|SpaceAfter=No 3 篇 篇 NOUN n,名詞,可搬,伝達 0 root Gloss=section-of-a-book|SpaceAfter=No 4 第 第 NOUN n,名詞,数量, 3 list Gloss=order-in-a-sequence|SpaceAfter=No 5 一 一 NUM n,数詞,数字, 4 nummod Gloss=one|SpacesAfter=\n

indicate that they are taken from https://www.kanripo.org/text/KR1h0001/001 and you can find that KR1h0001 is 孟子 in 經部-四書類 in Kanripo.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>