跨语种复制模式下从日语到中文会出现粤语输出 Cantonese output appears from Japanese to Chinese in cross-language copying mode

liujiaqi7998 commented 6 days ago

Describe the bug

跨语种复制模式下从日语到中文会出现粤语输出 For Title , Cantonese output appears from Japanese to Chinese in cross-language copying mode

Reapped

Get some pure human voice sets of Japanese as a reference sample for cross -language replication
Create a MAP table to represent the Chinese audio content to be generated
Use the following code conversion ` tts_text = "<|zh|>" + 目标输出文字 prompt_speech_16k = load_wav(person_voice_file, prompt_sr) for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False)): torchaudio.save(chinese_person_voice_file, j['tts_speech'], 22050) `
The phenomenon of mixing of Mandarin and Cantonese output results

Expected behavior

Data sets: 433 original audio and corresponding pre -generated Chinese content. The average audio is within 3 seconds, and the pre -generated text is about 5 words. Conclusion: After joining the "<| zh |>" limit, more than 50%of the content still appears in Cantonese

复现

获取到一些日语的纯净人声数据集作为跨语种复制的参考样本
建立一个MAP表表示要生成的中文音频内容

使用如下代码转换

tts_text = "<|zh|>" + 目标输出文字
prompt_speech_16k = load_wav(person_voice_file, prompt_sr)
for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False)):
torchaudio.save(chinese_person_voice_file, j['tts_speech'], 22050)

输出结果出现普通话和粤语混合的现象

预期行为

数据集：433条原始音频和对应的预生成中文内容，音频平均时长在3秒内，预生成文字在5字左右结论：在加入“<|zh|>”限制后，仍然有超过50%的内容出现了粤语

aluminumbox commented 6 days ago

well this is the drawback of bpe tokenize. zero shot/cross lingual mode is not so stable because chinese and Cantonese have same character

liujiaqi7998 commented 6 days ago

Thanks a lot Yes, it's exactly what I expected My guess is that the model is trained to use the same string in Chinese and Cantonese For me, add judgment to the output and use a new random seed to recalculate if the result is unexpected

aluminumbox commented 6 days ago

Thanks a lot Yes, it's exactly what I expected My guess is that the model is trained to use the same string in Chinese and Cantonese For me, add judgment to the output and use a new random seed to recalculate if the result is unexpected

nice trick

Anmidy commented 4 days ago

@liujiaqi7998 你好，请问你的 tts_text参数中目标文字是日语文本吗？person_voice_file.wav 文件是日语音频吗？这个代码是想将日语文本生成中文音频吗？

我和你的相反，想将中文文本生成日语音频，代码如下所示：

cosyvoice = CosyVoice('../../pretrained_models/CosyVoice-300M')
    tts_text = "<|jp|>你好"
    prompt_speech_22k = load_wav('../../cross_lingual_jp.wav', 22050)
    for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_22k, stream=False)):
       torchaudio.save('cross_lingual_zh2jp.wav', j['tts_speech'], 22050)

cross_lingual_jp.wav 音频文件是日语音频文件，但是生成的结果 cross_lingual_zh2jp.wav文件音频内容还是中文，并不是预期的日语，请问需要怎么修改呢？

liujiaqi7998 commented 4 days ago

你好，请问你的 tts_text参数中目标文字是日语文本吗？person_voice_file.wav 文件是日语音频吗？这个代码是想将日语文本生成中文音频吗？我和你的相反，想将中文文本生成日语音频，代码如下所示： cosyvoice = CosyVoice('../../pretrained_models/CosyVoice-300M') tts_text = "<|jp|>你好" prompt_speech_22k = load_wav('../../cross_lingual_jp.wav', 22050) for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_22k, stream=False)): torchaudio.save('cross_lingual_zh2jp.wav', j['tts_speech'], 22050) cross_lingual_jp.wav 音频文件是日语音频文件，但是生成的结果 cross_lingual_zh2jp.wav文件音频内容还是中文，并不是预期的日语，请问需要怎么修改呢？

@Anmidy 首先模型的输出和输入的字符串相关，你需要将“你好”翻译成“こんにちは”，load_wav理论上加载源语言的音频（存疑）

Anmidy commented 4 days ago

@liujiaqi7998 意思是三个方法：inference_sft、inference_zero_shot和inference_cross_lingual，并不能直接将中文文本转成日语音频吗？但是readme中的这个例子，感觉像是将英文文本转成中文音频了样，是我理解的有偏差吗？

# cross_lingual usage
prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k, stream=False)):
    torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], 22050)

FunAudioLLM / CosyVoice

跨语种复制模式下从日语到中文会出现粤语输出 Cantonese output appears from Japanese to Chinese in cross-language copying mode #385