FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
4.81k stars 489 forks source link

跨语种复制模式下从日语到中文会出现粤语输出 Cantonese output appears from Japanese to Chinese in cross-language copying mode #385

Open liujiaqi7998 opened 6 days ago

liujiaqi7998 commented 6 days ago

Describe the bug

跨语种复制模式下从日语到中文会出现粤语输出 For Title , Cantonese output appears from Japanese to Chinese in cross-language copying mode

Reapped

  1. Get some pure human voice sets of Japanese as a reference sample for cross -language replication
  2. Create a MAP table to represent the Chinese audio content to be generated
  3. Use the following code conversion ` tts_text = "<|zh|>" + 目标输出文字 prompt_speech_16k = load_wav(person_voice_file, prompt_sr) for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False)): torchaudio.save(chinese_person_voice_file, j['tts_speech'], 22050) `
  4. The phenomenon of mixing of Mandarin and Cantonese output results

Expected behavior

Data sets: 433 original audio and corresponding pre -generated Chinese content. The average audio is within 3 seconds, and the pre -generated text is about 5 words. Conclusion: After joining the "<| zh |>" limit, more than 50%of the content still appears in Cantonese


复现

  1. 获取到一些日语的纯净人声数据集作为 跨语种复制 的参考样本
  2. 建立一个MAP表表示要生成的中文音频内容
  3. 使用如下代码转换
    tts_text = "<|zh|>" + 目标输出文字
    prompt_speech_16k = load_wav(person_voice_file, prompt_sr)
    for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False)):
    torchaudio.save(chinese_person_voice_file, j['tts_speech'], 22050)
  4. 输出结果出现普通话和粤语混合的现象

预期行为

数据集:433条原始音频和对应的预生成中文内容,音频平均时长在3秒内,预生成文字在5字左右 结论:在加入“<|zh|>”限制后,仍然有超过50%的内容出现了粤语

aluminumbox commented 6 days ago

well this is the drawback of bpe tokenize. zero shot/cross lingual mode is not so stable because chinese and Cantonese have same character

liujiaqi7998 commented 6 days ago

Thanks a lot Yes, it's exactly what I expected My guess is that the model is trained to use the same string in Chinese and Cantonese For me, add judgment to the output and use a new random seed to recalculate if the result is unexpected

aluminumbox commented 6 days ago

Thanks a lot Yes, it's exactly what I expected My guess is that the model is trained to use the same string in Chinese and Cantonese For me, add judgment to the output and use a new random seed to recalculate if the result is unexpected

nice trick

Anmidy commented 4 days ago

@liujiaqi7998 你好,请问你的 tts_text参数中目标文字是 日语文本吗?person_voice_file.wav 文件是日语音频吗?这个代码是想将日语文本生成中文音频吗?

我和你的相反,想将中文文本生成日语音频,代码如下所示:

cosyvoice = CosyVoice('../../pretrained_models/CosyVoice-300M')
    tts_text = "<|jp|>你好"
    prompt_speech_22k = load_wav('../../cross_lingual_jp.wav', 22050)
    for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_22k, stream=False)):
       torchaudio.save('cross_lingual_zh2jp.wav', j['tts_speech'], 22050)

cross_lingual_jp.wav 音频文件是日语音频文件,但是生成的结果 cross_lingual_zh2jp.wav文件音频内容还是 中文,并不是预期的日语,请问需要怎么修改呢?

liujiaqi7998 commented 4 days ago

你好,请问你的 tts_text参数中目标文字是 日语文本吗?person_voice_file.wav 文件是日语音频吗?这个代码是想将日语文本生成中文音频吗? 我和你的相反,想将中文文本生成日语音频,代码如下所示: cosyvoice = CosyVoice('../../pretrained_models/CosyVoice-300M') tts_text = "<|jp|>你好" prompt_speech_22k = load_wav('../../cross_lingual_jp.wav', 22050) for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_22k, stream=False)): torchaudio.save('cross_lingual_zh2jp.wav', j['tts_speech'], 22050) cross_lingual_jp.wav 音频文件是日语音频文件,但是生成的结果 cross_lingual_zh2jp.wav文件音频内容还是 中文,并不是预期的日语,请问需要怎么修改呢?

@Anmidy 首先模型的输出和输入的字符串相关,你需要将“你好”翻译成“こんにちは”,load_wav理论上加载源语言的音频(存疑)

Anmidy commented 4 days ago

@liujiaqi7998 意思是三个方法:inference_sft、inference_zero_shot和inference_cross_lingual,并不能直接将中文文本转成日语音频吗? 但是readme中的这个例子,感觉像是将英文文本转成中文音频了样,是我理解的有偏差吗?

# cross_lingual usage
prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k, stream=False)):
    torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], 22050)