Open jacksonjack001 opened 3 days ago
for example, the following sentence in this code,result in very differenct voice!!
text = "大家好,这里给大家介绍一篇名为AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE的论文," output = cosyvoice.inference_sft( text.replace("-", ""), "中文女", ) for i, j in enumerate(output): torchaudio.save("sft_uppercase.wav".format(i), j["tts_speech"], 22050)
output = cosyvoice.inference_sft( text.replace("-", "").lower(), "中文女", ) for i, j in enumerate(output): torchaudio.save("sft_lowercase.wav".format(i), j["tts_speech"], 22050)
When using CosyVoice in a mix of Chinese and English, if the English text is in uppercase, it reads a single word as multiple words, resulting in a strong no English accent. This issue needs to be addressed.