Kyubyong / g2pC

g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese
Apache License 2.0
235 stars 30 forks source link

Wrong pinyin tone #1

Open begeekmyfriend opened 5 years ago

begeekmyfriend commented 5 years ago
>>> g2p('一心一意')
[('一心一意', 'i', 'yi1 xin1 yi1 yi4', "/concentrating one's thoughts and efforts/single-minded/bent on/intently/", '一心一意')]

Should be 'yi4 xin1 yi2 yi4'. See

Kyubyong commented 5 years ago

Thanks for this. But I'm not sure if this is WRONG. Chinese regular tone changes are not written according to Instead, I think it's better to distinguish those two: original vs. rule-applied.

Jackiexiao commented 5 years ago

cedict.txt only have yi1, only have bu4, if you want to distinguish them, you need data

Kyubyong commented 5 years ago

I think some simple rules can help. I'm working on them. I'll be back in hours.

Kyubyong commented 5 years ago

@begeekmyfriend I've added the pronunciation that tone change rules are applied to. Upgrade the library to check it and please let me know if it is okay. Thanks for pointing this out.

>>> g2p("一心一意")
'yi1 xin1 yi1 yi4', # this is the original pronunciation
'yi4 xin1 yi2 yi4',   # this is the descriptive pronunciation
"/concentrating one's thoughts and efforts/single-minded/bent on/intently/", 
Jackiexiao commented 5 years ago

and 33 to 23 actually need to predict.

for example: 有一次 -> you2 yi2 ci4, but 第一次 -> di4 yi1 ci4; the pronunciation of depends on semantic

begeekmyfriend commented 5 years ago

有一次 should be close to the segmentation of 一次 (not 有一 because it is not a word in Chinese) and 第一次 should be close to the segmentation of 第一.

Jackiexiao commented 5 years ago

more example:

Kyubyong commented 5 years ago

@Jackiexiao Can you clarify what you mean? It's confusing. The current results for the strings above are like:

有一次 original: you3 yi1 ci4 descriptive (tone changed): you3 yi2 ci4

第一次。 original: di4 yi1 ci4 。 descriptive (tone changed): di4 yi2 ci4 。

十一二岁来到戏校 original: shi2 yi1 er4 sui4 lai2 dao4 xi4 xiao4 descriptive (tone changed): shi2 yi2 er4 sui4 lai2 dao4 xi4 xiao4

同年十一月 original: tong2 nian2 shi2 yi1 yue4 descriptive (tone changed): tong2 nian2 shi2 yi2 yue4

一九八二年英文版 original: yi1 jiu3 ba1 er4 nian2 ying1 wen2 ban3 descriptive (tone changed): yi4 jiu3 ba1 er4 nian2 ying1 wen2 ban3

欧洲统一步伐 original: ou1 zhou1 tong3 yi1 bu4 fa2 descriptive (tone changed): ou1 zhou1 tong3 yi2 bu4 fa2

吉林省一号工程 original: ji2 lin2 sheng3 yi1 hao4 gong1 cheng2 descriptive (tone changed): ji2 lin2 sheng3 yi2 hao4 gong1 cheng2

一是选拔优秀干部 original: yi1 shi4 xuan3 ba2 you1 xiu4 gan4 bu4 descriptive (tone changed): yi2 shi4 xuan3 ba2 you1 xiu4 gan4 bu4

Which parts are incorrect?

begeekmyfriend commented 5 years ago

Well it is really confusing when you first learn Chinese on , and double 3rd tone. Let me show you the rough rule on base on sentences above.

有一次 means one time which is a regular word in Chinese. Therefore the tone of depends on the following character . And 有一 is not a segmented word. So we read it as yi2 ci4.

第一次 means the first time where 第一 is a segmented word in Chinese. So we ignore behind and read it as di4 yi1 ci4.

十一二岁 we find that here 十一 can be segmented as a word. So we ignore the following and read it as shi2 yi1 er2.

同年十一月 here 十一 can be segmented as a word so we read it as shi2 yi1 yue4.

一九八二年英文版 here can be regarded as a single number character and parallel with . So we read it as yi1 jiu3 ba1 er4.

欧洲统一步伐 where 统一 is seperate from 欧洲 and 步伐 in Chinese words and there is no following character behind it so we read it as tong3 yi1.

吉林省一号工程 where 一号 is seperate from 吉林省 and 工程, and 一号 is not a regular word like one day or one time. It only means number one so we read it as yi1 hao4.

一是选拔优秀干部 where is a single number word and 一是 is not a segmented word. So we read it as yi1.

Kyubyong commented 5 years ago

According to

For 一 yī:

   1.  一 is pronounced with second tone when followed by a fourth tone syllable.

        Example: 一定 (yī+dìng, "must") becomes yídìng [i˧˥tiŋ˥˩]

   2. Before a first, second or third tone syllable, 一 is pronounced with fourth tone.

        Examples:一天 (yī+tiān, "one day") becomes yìtiān [i˥˩tʰjɛn˥], 一年 (yī+nián, "one year") becomes yìnián [i˥˩njɛn˧˥], 一起 (yī+qǐ, "together") becomes yìqǐ [i˥˩t͡ɕʰi˨˩˦].

    3. When final, or when it comes at the end of a multi-syllable word (regardless of the first tone of the next word), 一 is pronounced with first tone. It also has first tone when used as an ordinal number (or part of one), and when it is immediately followed by any digit (including another 一; hence both syllables of the word 一一 yīyī and its compounds have first tone).
    4. When 一 is used between two reduplicated words, it may become neutral in tone (e.g. 看一看 kànyikàn ("to take a look of")).

So are the rules 1 and 2 applied word-internally only? In other words, when 一 is followed by a fourth-tone character which belongs to a separate word, 一 is read as first tone, not second tone?

begeekmyfriend commented 5 years ago

That is right for what you have learned.

Jackiexiao commented 5 years ago

give another interesting example:

begeekmyfriend commented 5 years ago

一线希望 can be regarded as a regular word in such case while 一线城市 should be segmented as , 线 and 城市. That is why Chinese always drives you mad.

Kyubyong commented 5 years ago

I'm looking at the literature about the tone change rules. Unfortunately, most of them are not clear about the boundaries. But some say the tone change rules MAY work across word boundaries. If my understanding is correct, things are more complicated. If we just think all the tone change rules including third tone, 一, and 不 occur word-internally, things are simple, but I'm not sure if that's true.

begeekmyfriend commented 5 years ago

I do not think one can do Chinese Pinyin conversion totally correct. There are no rules but conventions. A enoumous pinyin dictionary is indisensable in such issue. That is what we can do about it in all.

Kyubyong commented 5 years ago

Okay. I've updated it to I tried to refine the rules. Feel free to check it.

Weil2017 commented 5 years ago

Hi Kyubyong, The tone change for "一" also depends on context. Some more complicated examples: 一(yi1)层 means the first floor; 一(yi4)层,means one floor or one layer. 一(yi1)级 means the first level (class); 一(yi4)级,means one (more or less) level

Do you consider to use machine learning like CRF to predict the tone change of 一?


begeekmyfriend commented 5 years ago

I have found a well designed Chinese pinyin dictionary from espeak with 21567 single characters plus 36098 compound exceptions (includes 332 added 'yi' and 10720 added 'bu' exceptions, and 9713 extra 2-syllable words for 3rd-tone sandhi blocking). Would you like to replace the original one with it @Kyubyong ?

JohnHerry commented 2 years ago

It is hard to get correct tone all the time to some characters. As for "一" 一心一意 yi4 xin1 yi2 yi4 【 yi1 yin1 yi2 yi4 , it is fine in oral, too】 赵一心 zhao4 yi1 xin1 一起 yi4 qi3 一起案件 yi1 qi3 an4 jian4 三百零一 san1 bai3 ling2 yi1 看一看 kan4 yi5 kan4 一看究竟 yi2 kan4 jiu1 jing4 独一无二 du2 yi1 wu2 er4 一无所有 yi4 wu2 suo3 you3

As for "不" 来不来 lai2 bu5 lai2 不来算了 bu4 lai2 suan4 le5 不得不说 bu4 de2 bu1 shuo1 你不说谁知道 ni3 bu1 shuo1 shui2 zhi1 dao5 不要 bu2 yao4 不三不四 bu4 san1 bu2 si4

As for the consistent third tones: 蒙古 meng2 gu3 奄奄一息 yan6 yan3 yi4 xi1 取水组 qu6 shui6 zu3 李组长 Li3 zu6 zhang3 懒懒散散 lan6 lan3 san6 san3 OR lan6 lan6 san6 san3
懵懵懂懂 meng6 meng6 dong6 dong3

As for “子” 燕子 yan4 zi5 孩子 hai2 zi5 虫子 chong2 zi5 孔子 kong3 zi3 韩非子 han2 fei1 zi3 五味子 wu3 wei4 zi3 妹子 mei4 zi5 小野妹子 xiao3 ye3 mei4 zi3

As for "个" 个性 ge4 xing4 个体 ge4 ti3 三个和尚 san1 ge5 he2 shang4 打个的 da3 ge5 di1 买个袜子 mai3 ge5 wa4 zi5

As for “头” 头发 tou2 fa5 头头是道 tou2 tou2 shi4 dao4 尽头 jin4 tou2 个头 ge4 tou2 甜头 tian2 tou5 木头 mu4 tou5 锄头 chu2 tou5 彩头 cai3 tou2

Even when on the same character in same word, it will pronounce differently when the speaker have different emotion. 大家都要好好的(hao2 hao3 de5)。 你好好的(hao3 hao1 de5)学着点,别人怎么做的! 这就是个好好(hao3 hao3)先生。