Open liujiaqi7998 opened 6 days ago
well this is the drawback of bpe tokenize. zero shot/cross lingual mode is not so stable because chinese and Cantonese have same character
Thanks a lot Yes, it's exactly what I expected My guess is that the model is trained to use the same string in Chinese and Cantonese For me, add judgment to the output and use a new random seed to recalculate if the result is unexpected
Thanks a lot Yes, it's exactly what I expected My guess is that the model is trained to use the same string in Chinese and Cantonese For me, add judgment to the output and use a new random seed to recalculate if the result is unexpected
nice trick
@liujiaqi7998 你好,请问你的 tts_text参数中目标文字是 日语文本吗?person_voice_file.wav 文件是日语音频吗?这个代码是想将日语文本生成中文音频吗?
我和你的相反,想将中文文本生成日语音频,代码如下所示:
cosyvoice = CosyVoice('../../pretrained_models/CosyVoice-300M')
tts_text = "<|jp|>你好"
prompt_speech_22k = load_wav('../../cross_lingual_jp.wav', 22050)
for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_22k, stream=False)):
torchaudio.save('cross_lingual_zh2jp.wav', j['tts_speech'], 22050)
cross_lingual_jp.wav 音频文件是日语音频文件,但是生成的结果 cross_lingual_zh2jp.wav文件音频内容还是 中文,并不是预期的日语,请问需要怎么修改呢?
你好,请问你的 tts_text参数中目标文字是 日语文本吗?person_voice_file.wav 文件是日语音频吗?这个代码是想将日语文本生成中文音频吗? 我和你的相反,想将中文文本生成日语音频,代码如下所示:
cosyvoice = CosyVoice('../../pretrained_models/CosyVoice-300M') tts_text = "<|jp|>你好" prompt_speech_22k = load_wav('../../cross_lingual_jp.wav', 22050) for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_22k, stream=False)): torchaudio.save('cross_lingual_zh2jp.wav', j['tts_speech'], 22050)
cross_lingual_jp.wav 音频文件是日语音频文件,但是生成的结果 cross_lingual_zh2jp.wav文件音频内容还是 中文,并不是预期的日语,请问需要怎么修改呢?
@Anmidy 首先模型的输出和输入的字符串相关,你需要将“你好”翻译成“こんにちは”,load_wav理论上加载源语言的音频(存疑)
@liujiaqi7998 意思是三个方法:inference_sft、inference_zero_shot和inference_cross_lingual,并不能直接将中文文本转成日语音频吗? 但是readme中的这个例子,感觉像是将英文文本转成中文音频了样,是我理解的有偏差吗?
# cross_lingual usage
prompt_speech_16k = load_wav('cross_lingual_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k, stream=False)):
torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], 22050)
Describe the bug
跨语种复制模式下从日语到中文会出现粤语输出 For Title , Cantonese output appears from Japanese to Chinese in cross-language copying mode
Reapped
` tts_text = "<|zh|>" + 目标输出文字 prompt_speech_16k = load_wav(person_voice_file, prompt_sr) for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False)): torchaudio.save(chinese_person_voice_file, j['tts_speech'], 22050)
`Expected behavior
Data sets: 433 original audio and corresponding pre -generated Chinese content. The average audio is within 3 seconds, and the pre -generated text is about 5 words. Conclusion: After joining the "<| zh |>" limit, more than 50%of the content still appears in Cantonese
复现
预期行为
数据集:433条原始音频和对应的预生成中文内容,音频平均时长在3秒内,预生成文字在5字左右 结论:在加入“<|zh|>”限制后,仍然有超过50%的内容出现了粤语