Closed YuhuYang closed 1 year ago
Thank you @YuhuYang for the suggestion but, in fact, we could not determine when the sentences had been written. We started UD_Classical_Chinese-Kyoto with 四書 (孟子, 論語, 大學, and 中庸) whose contents should have written in 秦 dynasty, but we scrutinized their sentences and had found that they were partially re-written in 唐 dynasty. Furthermore they often include older (pre-秦?) quotes with "詩曰" and so on. Then @YuhuYang how do we split the sentences into the subcategories?
Thanks for your callback! What you talked is really a big problem, and maybe the era is not a good choice. Or the data can be split into different genres? Such as poems(楚辞、诗经、唐诗三百首),sutra(佛经),Classics(孟子, 論語, 大學, 中庸),history(十八史略). Maybe this strategy can be helpful for reducing the heterogeneity. Or give an additional branch which is consist of different books. Through it, people can choice the part they want continently. Just for suggestions. Thanks again!
Well @YuhuYang I'm vague whether your criteria work well, but newdoc id
and sent_id
would help you. For example, in the first lines of lzh_kyoto-ud-test.conllu:
# newdoc id = KR1h0004_001
# sent_id = KR1h0004_001_title
# text = 學而篇第一
1 學 學 VERB v,動詞,行為,動作 _ 3 acl _ Gloss=study|SpaceAfter=No
2 而 而 CCONJ p,助詞,接続,並列 _ 1 orphan _ Gloss=and|SpaceAfter=No
3 篇 篇 NOUN n,名詞,可搬,伝達 _ 0 root _ Gloss=section-of-a-book|SpaceAfter=No
4 第 第 NOUN n,名詞,数量,* _ 3 list _ Gloss=order-in-a-sequence|SpaceAfter=No
5 一 一 NUM n,数詞,数字,* _ 4 nummod _ Gloss=one|SpacesAfter=\n
indicate that they are taken from https://www.kanripo.org/text/KR1h0004/001 and you can find that KR1h0004
is 論語 in 經部-四書類 in Kanripo.
Thanks!
在 2023-09-11 09:29:26,"Koichi Yasuoka" @.***> 写道:
Well @YuhuYang I'm vague whether your criteria work well, but newdoc id and sent_id would help you. For example, in the first lines of lzh_kyoto-ud-test.conllu:
1 學 學 VERB v,動詞,行為,動作 3 acl Gloss=study|SpaceAfter=No 2 而 而 CCONJ p,助詞,接続,並列 1 orphan Gloss=and|SpaceAfter=No 3 篇 篇 NOUN n,名詞,可搬,伝達 0 root Gloss=section-of-a-book|SpaceAfter=No 4 第 第 NOUN n,名詞,数量, 3 list Gloss=order-in-a-sequence|SpaceAfter=No 5 一 一 NUM n,数詞,数字, 4 nummod Gloss=one|SpacesAfter=\n
indicate that they are taken from https://www.kanripo.org/text/KR1h0001/001 and you can find that KR1h0001 is 孟子 in 經部-四書類 in Kanripo.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Ancient Chinese is not homogeneous internally. This characteristic is because we call all non-modern Chinese parts modern Chinese. Although there has been a long separation between spoken and written language in Chinese history, which has allowed written language to preserve its initial state to a certain extent, changes cannot be ignored. So it seems more appropriate to divide the data set into several subcategories according to the era, such as pre-Qin, Qin-Han, Wei-Jin, Tang-Song, Yuan, Ming and Qing Dynasties.