2noise / ChatTTS

A generative speech model for daily dialogue.
https://2noise.com
Other
27.81k stars 3.02k forks source link

在chattts现有模型上进行加速推理和流式语音输出 #226

Closed hwang824 closed 2 weeks ago

hwang824 commented 1 month ago

chattts确实是目前开源tts的天花板。 但用于实际应用,应该还要解决下面几个问题:

  1. 训练自己的语音(克隆)
  2. 加速推理(目前推理速度太慢,很难做到机器人实时对话)
  3. 推理结果流式输出(做到机器人实时对话必须要流式输出)

咨询过作者,关于推理加速,以及流式输出,作者希望由社区来自主开发完成。不知道有没有有兴趣的朋友已经在动手做了?

shirubei commented 1 month ago

关于第2点深有体会。我本地环境,vits-fine-tuning的话,加载完模型,生成一个3-5秒的语音不到1s,而chatTTS要20多秒 再一个就是,加[laugh]也不一定能出来笑声,感觉像抽卡一样,有时可以有时不行。

Pydataman commented 1 month ago

没数据一切都是无用功

matbee-eth commented 1 month ago

ChatGPTS is indeed the ceiling of the current open-source TTS. However, for practical application, the following problems should be solved:

  1. Train your own voice (clone)
  2. Accelerated inference (the current inference speed is too slow, and it is difficult to achieve real-time dialogue between bots)
  3. Streaming output of inference results (streaming output is necessary for real-time dialogue of robots)

I consulted with the author about inference acceleration, and streaming output, and the author hopes that the community will develop it on its own. I don't know if there are any interested friends who are already doing it?

hard to develop it on your own without the training scripts/dataset formatting for their LLaMa model or their VQ encoder

gatusokaka commented 1 month ago

还有笑声和停顿的bug请作者尽快解决,能准确的插入笑声和停顿很重要

ManBali commented 4 weeks ago

还有笑声和停顿的bug请作者尽快解决,能准确的插入笑声和停顿很重要

这话说的,你行你上。

Strive-for-excellence commented 3 weeks ago

chattts确实是目前开源tts的天花板。 但用于实际应用,应该还要解决下面几个问题:

  1. 训练自己的语音(克隆)
  2. 加速推理(目前推理速度太慢,很难做到机器人实时对话)
  3. 推理结果流式输出(做到机器人实时对话必须要流式输出)

咨询过作者,关于推理加速,以及流式输出,作者希望由社区来自主开发完成。不知道有没有有兴趣的朋友已经在动手做了?

声音克隆,作者有没有计划开源

fumiama commented 2 weeks ago

流式输出已添加,其它需求与别的issue重复,因此关闭此issue,只保留一份。

statsmind commented 3 days ago

一个是流式输出,另外重要的是流式输入吧,不可能等大模型返回所有结果再转成语音,要不就太慢了

fumiama commented 2 days ago

不可能等大模型返回所有结果再转成语音,要不就太慢了

ChatTTS原理决定了它需要文本的上下文,也就是说至少要有一段文本才可以,不像传统TTS,可以一个个字音拼接起来。如果LLM返回很长一段话,建议在自己的代码中按句子做分割,然后依次调用ChatTTS推理。