PaddlePaddle / PaddleSpeech

Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.
https://paddlespeech.readthedocs.io
Apache License 2.0
10.55k stars 1.81k forks source link

[TTS]中英混合流式语音合成推理时卡顿感严重 #3642

Open hexianbin1994 opened 6 months ago

hexianbin1994 commented 6 months ago

参考 examples/zh_en_tts/tts3 中的语音合成示例,下载了示例中的模型文件,把相关配置改成流式语音的配置项后,调用流式合成能进行部分字母及单词的合成,但有两个问题: 1、部分字母发音不准缺,如A,M,N,I,Z等 2、有非常明显的卡顿感, 这个如何解决?

conf 文件:

# This is the parameter configuration file for streaming tts server.

#################################################################################
#                             SERVER SETTING                                    #
#################################################################################
host: 0.0.0.0
port: 8190

# The task format in the engin_list is: <speech task>_<engine type>
# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
# protocol choices = ['websocket', 'http'] 
protocol: 'websocket'
engine_list: ['tts_online']

#################################################################################
#                                ENGINE CONFIG                                  #
#################################################################################

################################### TTS #########################################
################### speech task: tts; engine_type: online #######################
tts_online: 
    # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']   
    # fastspeech2_cnndecoder_csmsc support streaming am infer.     
    am: 'fastspeech2_mix'
    am_config: 'pretrain/fastspeech2_mix_ckpt_1.2.0/default.yaml'
    am_ckpt:  'pretrain/fastspeech2_mix_ckpt_1.2.0/snapshot_iter_99200.pdz'
    am_stat: 'pretrain/fastspeech2_mix_ckpt_1.2.0/speech_stats.npy'
    phones_dict: 'pretrain/fastspeech2_mix_ckpt_1.2.0/phone_id_map.txt'
    tones_dict: 
    speaker_dict: 'pretrain/fastspeech2_mix_ckpt_1.2.0/speaker_id_map.txt'

    # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
    # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
    voc: 'hifigan_csmsc'
    voc_config: 'pretrain/hifigan_csmsc_ckpt_0.1.1/default.yaml'
    voc_ckpt: 'pretrain/hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz'
    voc_stat: 'pretrain/hifigan_csmsc_ckpt_0.1.1/feats_stats.npy'

    # others
    lang: 'mix'
    device: 'cpu' # set 'gpu:id' or 'cpu'
    # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
    # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
    am_block: 72
    am_pad: 12
    # voc_pad and voc_block voc model to streaming voc infer,
    # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
    # when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
    voc_block: 36
    voc_pad: 19

#################################################################################
#                                ENGINE CONFIG                                  #
#################################################################################

################################### TTS #########################################
################### speech task: tts; engine_type: online-onnx #######################
tts_online-onnx: 
    # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
    # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.        
    am: 'fastspeech2_cnndecoder_csmsc_onnx' 
    # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
    # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
    am_ckpt:   # list
    am_stat: 
    phones_dict: 
    tones_dict: 
    speaker_dict: 
    am_sample_rate: 24000
    am_sess_conf:
        device: "cpu" # set 'gpu:id' or 'cpu'
        use_trt: False
        cpu_threads: 4

    # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
    # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
    voc: 'mb_melgan_csmsc_onnx'
    voc_ckpt: 
    voc_sample_rate: 24000
    voc_sess_conf:
        device: "cpu" # set 'gpu:id' or 'cpu'
        use_trt: False
        cpu_threads: 4

    # others
    lang: 'zh'
    # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
    # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
    am_block: 72
    am_pad: 12
    # voc_pad and voc_block voc model to streaming voc infer,
    # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
    # when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
    voc_block: 36
    voc_pad: 14
    # voc_upsample should be same as n_shift on voc config.
    voc_upsample: 300
jobsjiang commented 3 months ago

Snipaste_2024-03-06_10-05-55 按照相同的配置,提示这样的问题,请问什么原因

Ankh-L commented 3 months ago

Snipaste_2024-03-06_10-05-55 按照相同的配置,提示这样的问题,请问什么原因

请问解决了吗

hexianbin1994 commented 3 months ago

Snipaste_2024-03-06_10-05-55 按照相同的配置,提示这样的问题,请问什么原因

要改源码,检查下导入am模型那块,加上配置文件里的模型

Ankh-L commented 3 months ago

Snipaste_2024-03-06_10-05-55 按照相同的配置,提示这样的问题,请问什么原因

要改源码,检查下导入am模型那块,加上配置文件里的模型

感谢回复。看源码只支持fastspeech2_csmsc 和fastspeech2_cnndecoder,您说的“检查下导入am模型那块,加上配置文件里的模型”,是指强行在if else中增加需要的模型吗?

hexianbin1994 commented 3 months ago

比如 am模型是【fastspeech2_mix】,就要加上这个模型的支持,比如: image

Ankh-L commented 3 months ago

我通过修改源码绕过了这里的检查,但是启动时依然报错: image 方便告知您使用的paddlepaddle和paddlespeech的版本吗?

jianghuakun commented 1 month ago

我通过修改源码绕过了这里的检查,但是启动时依然报错: image 方便告知您使用的paddlepaddle和paddlespeech的版本吗?

你的搞定没 我报的和你一样错。

jianghuakun commented 1 month ago

Snipaste_2024-03-06_10-05-55 按照相同的配置,提示这样的问题,请问什么原因

要改源码,检查下导入am模型那块,加上配置文件里的模型

感谢回复。看源码只支持fastspeech2_csmsc 和fastspeech2_cnndecoder,您说的“检查下导入am模型那块,加上配置文件里的模型”,是指强行在if else中增加需要的模型吗?

请问在哪修改?修改哪个路径的哪个文件?

jianghuakun commented 1 month ago
hifigan_csmsc

请问你的解决了没?

jianghuakun commented 1 month ago

image

jianghuakun commented 1 month ago

image

image

请问这个怎么修改???

jianghuakun commented 4 weeks ago

已经解决能正常启动,需要的找我