facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
29.85k stars 6.32k forks source link

[Wav2Vec2] Zero-shot Transfer model on Common Voice - Espeak questions for Chinese #4653

Open happylittlecat2333 opened 1 year ago

happylittlecat2333 commented 1 year ago

❓ Questions

Hey Quintong, @alexeib, @michaelauli,

Thanks a lot for open-sourcing the model weights of your recent paper Simple and Effective Zero-shot Cross-lingual Phoneme Recognition!s

Since Espeak-ng has split Chinese support cmn to cmn and cmn-latn-pinyin to support pinyin inputs, however in the latest version 1.51 cmn has the bug to predict the tone (which treats the tone as number), like the case for "你好" ("ni3 hao3" for pinyin) in version 1.51:

espeak-ng -v cmn "你好" -x --ipa=1
espeak-ng -v cmn-latn-pinyin "你好" -x --ipa=1
espeak-ng -v cmn-latn-pinyin "ni3 hao3" -x --ipa=1

and the result is below:

(en)_n_ɪ5_θ_ɹ_ˈiː5_ h_ˌeɪ5_ə5_θ_ɹ_ˈiː5_(cmn)_
n_ˈiɜ_ χ_ˈɑu2_
n_ˈiɜ_ χ_ˈɑu2_

Furthermore, I test the XLSR-53 model to predict IPA for Chinese speech, the test case in Chinese text is 宝马配跛骡鞍,貂蝉怨枕董翁榻。 and the result is:

pɑu5mɑ5piɛ5kwɑ5plu5a5ntjɑu5tɕha5njiɛ5ntɕiɛ5ntonɡ5vəɜŋpɑ5

and I use the Espeak-ng v1.51 to convert the text back to IPA is this (after text clean for _ and ˈ):

pɑuɜmɑ2phei5po2luoɜa5ntjɑu5ts.haɜnyæ5nts.əɜntonɡ2wuə5ŋthɑ5

which is not consistent with the prediction. So I believe the reason is that the Espeak version I used is not the same in your implementation. Could you tell me the Espeak version that you used and the command or script for Chinese?

Thanks a lot!

xuqiantong commented 1 year ago

Hi @happylittlecat2333,

I should have used v1.50 of espeak. I used cmn for Mandarin and yue for Cantonese. Some example outputs

from espeakng import ESpeakNG

esng = ESpeakNG(voice='cmn') 
ipa = esng.g2p('妳現在,好漂亮', ipa=1)
print(ipa)
ipa = esng.g2p('宋朝末年年间定居。粉岭围', ipa=1)
print(ipa)
ipa = esng.g2p('宋朝。末年年间定居粉岭围', ipa=1)
print(ipa)

n_ˈi2_ ɕ_ˈiɛ5_n_ ts_ˈai5_x_ˈɑu2_ ph_j_ˈɑu5_ l_ˈiɑ5_ŋ_
s_ˈonɡ5_ ts.h_ˈɑuɜ_ m_ˈo5_ n_ˈiɛɜ_n_ n_ˈiɛɜ_n_ tɕ_ˈiɛ5_n_ t_ˈi5_ŋ_ tɕ_ˈy5_f_ˈəɜ_n_ l_ˈi2_ŋ_ w_ˈeiɜ_
s_ˈonɡ5_ ts.h_ˈɑuɜ_m_ˈo5_ n_ˈiɛɜ_n_ n_ˈiɛɜ_n_ tɕ_ˈiɛ5_n_ t_ˈi5_ŋ_ tɕ_ˈy5_ f_ˈəɜ_n_ l_ˈi2_ŋ_ w_ˈeiɜ_
happylittlecat2333 commented 1 year ago

Thanks a lot @xuqiantong - that's super useful!

happylittlecat2333 commented 1 year ago

Hey Quintong, @alexeib, @michaelauli,

Thanks for your information. I changed the espeak version to v1.50, but I found a significant bug in Chinese tone for pronunciation. The problem is also described in espeak-ng issues. The IPA transcription result seems wrong for the languages with tone changes(e.g., Chinese mandarin).

Some example outputs

from espeakng import ESpeakNG

esng = ESpeakNG(voice='cmn')
ipa = esng.g2p('镜子', ipa=1)    # "jing4 zi5" for pinyin
print(ipa)
ipa = esng.g2p('经过', ipa=1)    # "jing1 guo4" for pinyin
print(ipa)
ipa = esng.g2p('妈 麻 马 骂', ipa=1) # "ma1 ma2 ma3 ma4" for pinyin
print(ipa)
ipa = esng.g2p('jing1 jing2 jing3 jing4', ipa=1)
print(ipa)

tɕ_ˈi5_ŋ_ ts_i̪1_
tɕ_ˈi5_ŋ_ k_ˈuo5_
m_ˈɑ5_ m_ˈɑɜ_ m_ˈɑ2_ m_ˈɑ5_
tɕ_ˈi5_ŋ_ tɕ_ˈiɜ_ŋ_ tɕ_ˈi2_ŋ_ tɕ_ˈi5_ŋ_

You can see that the result for the second character '麻' introduced a new vowel ɜ(Unicode 025C). And the tone changes are represented as 5, none, 2, 5 for each character respectively. For "jing1" and "jing4" espeak converted them to the same pronunciation "tɕ_ˈi5ŋ". According to Wikipedia, this doesn't seem right, so I checked other representations.

If we do not use IPA, the tones are correct

ipa = esng.g2p('妈 麻 马 骂')
print(ipa)
ipa = esng.g2p('ma1 ma2 ma3 ma4')
print(ipa)

m'A55_| m'A35_| m'A21_| m'A51_|
m'A55_| m'A35_| m'A21_| m'A51_|

Here, the system correctly identified the same vowel [A] for all characters and accurately distinguished tone changes. So, I think the problem is the conversion script for IPA transcription. I also test espeak version v1.51 and found that the bugs still remain. So do you have any suggestions or advices for this bug?

Thanks a lot!