在chattts现有模型上进行加速推理和流式语音输出

2noise / ChatTTS

A generative speech model for daily dialogue.

https://2noise.com

Other

27.81k stars 3.02k forks source link

在chattts现有模型上进行加速推理和流式语音输出 #226

Closed hwang824 closed 2 weeks ago

hwang824 commented 1 month ago

chattts确实是目前开源tts的天花板。但用于实际应用，应该还要解决下面几个问题：

训练自己的语音（克隆）
加速推理（目前推理速度太慢，很难做到机器人实时对话）
推理结果流式输出（做到机器人实时对话必须要流式输出）

咨询过作者，关于推理加速，以及流式输出，作者希望由社区来自主开发完成。不知道有没有有兴趣的朋友已经在动手做了？

shirubei commented 1 month ago

关于第2点深有体会。我本地环境，vits-fine-tuning的话，加载完模型，生成一个3-5秒的语音不到1s，而chatTTS要20多秒再一个就是，加[laugh]也不一定能出来笑声，感觉像抽卡一样，有时可以有时不行。

Pydataman commented 1 month ago

没数据一切都是无用功

matbee-eth commented 1 month ago

ChatGPTS is indeed the ceiling of the current open-source TTS. However, for practical application, the following problems should be solved:

Train your own voice (clone)

Accelerated inference (the current inference speed is too slow, and it is difficult to achieve real-time dialogue between bots)

Streaming output of inference results (streaming output is necessary for real-time dialogue of robots)

I consulted with the author about inference acceleration, and streaming output, and the author hopes that the community will develop it on its own. I don't know if there are any interested friends who are already doing it?

hard to develop it on your own without the training scripts/dataset formatting for their LLaMa model or their VQ encoder

gatusokaka commented 1 month ago

还有笑声和停顿的bug请作者尽快解决，能准确的插入笑声和停顿很重要

ManBali commented 4 weeks ago

还有笑声和停顿的bug请作者尽快解决，能准确的插入笑声和停顿很重要

这话说的，你行你上。

Strive-for-excellence commented 3 weeks ago

chattts确实是目前开源tts的天花板。但用于实际应用，应该还要解决下面几个问题：

训练自己的语音（克隆）

加速推理（目前推理速度太慢，很难做到机器人实时对话）

推理结果流式输出（做到机器人实时对话必须要流式输出）

咨询过作者，关于推理加速，以及流式输出，作者希望由社区来自主开发完成。不知道有没有有兴趣的朋友已经在动手做了？

声音克隆，作者有没有计划开源

fumiama commented 2 weeks ago

流式输出已添加，其它需求与别的issue重复，因此关闭此issue，只保留一份。

statsmind commented 3 days ago

一个是流式输出，另外重要的是流式输入吧，不可能等大模型返回所有结果再转成语音，要不就太慢了

fumiama commented 2 days ago

不可能等大模型返回所有结果再转成语音，要不就太慢了

ChatTTS原理决定了它需要文本的上下文，也就是说至少要有一段文本才可以，不像传统TTS，可以一个个字音拼接起来。如果LLM返回很长一段话，建议在自己的代码中按句子做分割，然后依次调用ChatTTS推理。