Simple context-dependent corrections to pinyin?

ssb22 commented 1 year ago

Just wondering if you can manage these which hopefully shouldn't be too difficult to code for:

If the word is 地: if we're at the very start of the sentence or the previous word was one of 的, 之, 块, 塊, 这, 這, 了, 在, 两 or 兩 then prefer dì (di4), otherwise prefer de (de5)
If the word is 得: if we're at the very start of the sentence or the previous word was one of 的, 之, 而, 同 or 想, then prefer dé (de2), otherwise if the previous word was one of 你, 他, 她, 您, 我, 还, 還 or anything ending with 们 or 們 then prefer děi (dei3), otherwise prefer de (de5)
If the word is 个 or 個 and the previous character was a digit, full-width digit or any of 一, 二, 两, 三, 四, 五, 六, 七, 八, 九, 十, 万, 打, 几, 幾, 那, 哪 then prefer ge5 to ge4
If the word is 只 and the previous character was a digit, full-width digit or any of 一, 二, 两, 三, 四, 五, 六, 七, 八, 九, 十, 万, 几, 幾, 那, 哪 (i.e. same as above list but not 打 this time) then prefer zhī (zhi1) to zhǐ (zhi3)

In each case all that's needed is to make sure the entry with the "preferred" reading is moved to the top of the alternatives. (There may of course be very rare cases where the above rules get it wrong, but in nearly all cases having them will be better than not having them, so I'd suggest putting preferred entries first but still keeping the others available just in case.)

My Annogen program found 6,000+ "context" rules (some of which look as far as 5 characters either side of the word), but many of these are likely to be quirks of the corpus I gave it and less accurate in other types of text. I've been trying to work out how to extract something more general. For now, I'm pretty sure the above few rules will work well on any text (along with some of the "long phrase" entries in CedPane that show specific ways of splitting or pronouncing words when they occur in that particular phrase, but we probably shouldn't turn rules as general as these into thousands of phrase entries when they could just as easily be hard-coded).

Thanks.

chinese-words-separator commented 1 year ago

I will implement the semantic corrections for 地 and 得

For 个個, I think it will have a conflict on tone sandhi with 一. I don't plan to implement tone rules correction so the learners can still see the original tones and accordingly apply the tone rules they have memorized. It's tricky to implement correction to tone changes, especially the multiple third tones that span many words

I mentioned the tone rules conflict, if 个 ge4 is changed to ge5 when there is a preceding 一 yi1, that is 一个 will become yi1 ge5; this will introduce confusion to learners when they already learned the tone rule that 一 yi1 becomes yi2 when followed by a 4th tone. So if they see that 个 is annotated with ge5 instead, they will not be able to detect that they need to also change the 一 yi1 to yi2 due to not seeing 个 as ge4

https://blog.skritter.com/2013/02/tone-sandhi-tone-changes-for-the-character-%E4%B8%80/

一個/一个 (yí ge) *while “個/个” it is originally a fourth tone.

The learner need to know that when 一 yi1 is followed by a 4th tone, e.g., 个 ge4, the 一 y1 will become yi2

Hmm.. apparently the learner also need to know that besides changing yi1 to yi2, that ge4 need to change to neutral tone ge5. Where to find the rule that change the 4th tone to neutral? I only know there are three tone rules, didn't know that there are these many nuances to Chinese tone rules 😆 I will add this to CWS's Notes

I only know that when something is repeated, the second syllable becomes neutral, e.g., 妈妈 instead of ma1 ma1, it's ma1 ma5. I don't have this tone rule in CWS's Notes yet, I think the neutral transformation is there to aid the speaker to speak more comfortably. But I will add the neutral tone rule to CWS's Notes when I see a Chinese learning site explaining the neutral tone change in detail

chinese-words-separator commented 1 year ago

Before:

After:

di4: 地块地塊地这地

de5: 孩子们快乐地唱歌

de2 得想得

dei3: 你得你们得你們得

de5: 她唱得很好

Thanks for this improvement suggestion, this can help the learners learn more efficiently

This will be on 8.24.84.580 release

chinese-words-separator / chinese-words-separator.github.io

Simple context-dependent corrections to pinyin? #3