2noise / ChatTTS

A generative speech model for daily dialogue.
https://2noise.com
GNU Affero General Public License v3.0
32.33k stars 3.52k forks source link

句子中间随机加“什么”,“就”之类的词,尾部截断;compile=True时推理过慢 #689

Open Ziyi6 opened 3 months ago

Ziyi6 commented 3 months ago

1、text为12个中文汉字,推理后生成的音频中间随机的地方会出现“什么”,“就”之类的说话不通顺的过渡词 2、句尾截断,最后会丢一个字,或者是丢最后一个字的大半个音(只读前小半的音),text同样为12个中文汉字 3、compile设为True时,推理过慢,3秒钟的音频需要花5分钟以上的时间

可以请作者看看这些问题吗,用的显卡是A100

代码:

import torch
import torchaudio
import numpy as np
import soundfile as sf 
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision('high')

import ChatTTS
from IPython.display import Audio

# Initialize and load the model: 
chat = ChatTTS.Chat()
chat.load(compile=False) # Set to True for better performance

# Define the text input for inference (Support Batching)
texts = [
    "So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with.",
    "海信小聚啊海信小聚啊海信小聚", "共青团爸爸海信小聚啊哈哈"]

# Perform inference and play the generated audio
wavs = chat.infer(texts)

# Save the generated audio 
sf.write('/data_hdd/test_syth.wav', np.squeeze(wavs[0]), 24000, 'PCM_16')
sf.write('/data_hdd/test_syth_2.wav', np.squeeze(wavs[1]), 24000, 'PCM_16')
sf.write('/data_hdd/test_syth_3.wav', np.squeeze(wavs[2]), 24000, 'PCM_16')
fumiama commented 3 months ago
  1. 这是特性,如不需要,可关闭 refine_text
  2. 可能是文本过于奇怪导致的
  3. 见 #476

如果仅使用推理功能,可尝试 dev 分支的 vLLM,提速很大。