FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
5.29k stars 543 forks source link

Problems when sentence is mix of Chinese and English #461

Open jacksonjack001 opened 3 days ago

jacksonjack001 commented 3 days ago

When using CosyVoice in a mix of Chinese and English, if the English text is in uppercase, it reads a single word as multiple words, resulting in a strong no English accent. This issue needs to be addressed.

jacksonjack001 commented 3 days ago

for example, the following sentence in this code,result in very differenct voice!!

text = "大家好,这里给大家介绍一篇名为AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE的论文," output = cosyvoice.inference_sft( text.replace("-", ""), "中文女", ) for i, j in enumerate(output): torchaudio.save("sft_uppercase.wav".format(i), j["tts_speech"], 22050)

output = cosyvoice.inference_sft( text.replace("-", "").lower(), "中文女", ) for i, j in enumerate(output): torchaudio.save("sft_lowercase.wav".format(i), j["tts_speech"], 22050)