PaddlePaddle / PaddleSpeech

Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.
https://paddlespeech.readthedocs.io
Apache License 2.0
11.16k stars 1.85k forks source link

[TTS]中英混合流式语音合成推理时无声音 #3799

Open jianghuakun opened 5 months ago

jianghuakun commented 5 months ago

yaml文件如下:

This is the parameter configuration file for streaming tts server.

#################################################################################

SERVER SETTING

################################################################################# host: 0.0.0.0 port: 8090

The task format in the enginlist is:

engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.

protocol choices = ['websocket', 'http']

protocol: 'websocket'

engine_list: ['tts_online-onnx']

engine_list: ['tts_online']

#################################################################################

ENGINE CONFIG

#################################################################################

################################### TTS ######################################### ################### speech task: tts; engine_type: online ####################### tts_online:

am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']

# fastspeech2_cnndecoder_csmsc support streaming am infer.     
am: 'fastspeech2_mix'   
am_config:  #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/fastspeech2_mix_ckpt_1.2.0/default.yaml' #/root/.paddlespeech/models/fastspeech2_csmsc-zh/1.0/fastspeech2_nosil_baker_ckpt_0.4/default.yaml
am_ckpt: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/fastspeech2_mix_ckpt_1.2.0/snapshot_iter_99200.pdz'
am_stat: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/fastspeech2_mix_ckpt_1.2.0/speech_stats.npy'
phones_dict: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/fastspeech2_mix_ckpt_1.2.0/phone_id_map.txt'
tones_dict: 
speaker_dict:  #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/fastspeech2_mix_ckpt_1.2.0/speaker_id_map.txt'
#spk_id: 175

# voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc: 'hifigan_csmsc'
voc_config: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/hifigan_csmsc_ckpt_0.1.1/default.yaml'
voc_ckpt: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/hifigan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz'
voc_stat: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/hifigan_csmsc_ckpt_0.1.1/feats_stats.npy'
# others
lang: 'mix'
device: 'cpu' # set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 72
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 36
voc_pad: 19

#################################################################################

ENGINE CONFIG

#################################################################################

################################### TTS ######################################### ################### speech task: tts; engine_type: online-onnx ####################### tts_online-onnx:

am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']

# fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.        
am: 'fastspeech2_csmsc_onnx' 
# am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
# if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
#am_config: 'fastspeech2_csmsc_onnx/fastspeech2_nosil_baker_ckpt_0.4/'
am_ckpt: #['/root/.paddlespeech/models/fastspeech2_csmsc_onnx-zh/1.0/fastspeech2_csmsc_onnx_0.2.0/fastspeech2_csmsc.onnx'] #['/home/PaddleSpeech/examples/zh_en_tts/tts3/fastspeech2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_encoder_infer.onnx',
#'/home/PaddleSpeech/examples/zh_en_tts/tts3/fastspeech2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_decoder.onnx',
#'/home/PaddleSpeech/examples/zh_en_tts/tts3/fastspeech2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_postnet.onnx'] #'fastspeech2_csmsc_onnx/fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz'    # list
am_stat: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/fastspeech2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/speech_stats.npy' #'fastspeech2_csmsc_onn2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/speech_stats.npy' #/fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy'
phones_dict: #'/root/.paddlespeech/models/fastspeech2_csmsc_onnx-zh/1.0/fastspeech2_csmsc_onnx_0.2.0/phone_id_map.txt' #'/home/PaddleSpeech/examples/zh_en_tts/tts3/fastspeech2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/phone_id_map.txt' #'fastspeech2_csmsc_onnx/fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt'
tones_dict: 
speaker_dict: #'fastspeech2_csmsc_onnx/fastspeech2_nosil_baker_ckpt_0.4/'
am_sample_rate: 24000
am_sess_conf:
    device: "cpu" # set 'gpu:id' or 'cpu'
    use_trt: False
    cpu_threads: 12

# voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc: 'mb_melgan_csmsc_onnx'
voc_ckpt: 
voc_sample_rate: 24000
voc_sess_conf:
    device: "cpu" # set 'gpu:id' or 'cpu'
    use_trt: False
    cpu_threads: 12

# others
lang: 'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 72
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 36
voc_pad: 14
# voc_upsample should be same as n_shift on voc config.
voc_upsample: 300

tts_engine.py增加mix源码: elif am_dataset == "mix":

am

            spk_id = 174
            spk_id = [spk_id]
            mel = self.executor.am_inference(
                    part_phone_ids)
            if first_flag == 1:
                first_am_et = time.time()
                self.first_am_infer = first_am_et - frontend_et
            # voc streaming
            mel_chunks = get_chunks(mel, self.voc_block, self.voc_pad,
                                    "voc")
            voc_chunk_num = len(mel_chunks)
            voc_st = time.time()
            for i, mel_chunk in enumerate(mel_chunks):
                sub_wav = self.executor.voc_inference(mel_chunk)
                sub_wav = self.depadding(sub_wav, voc_chunk_num, i,
                                         self.voc_block, self.voc_pad,
                                         self.voc_upsample)
                if first_flag == 1:
                    first_voc_et = time.time()
                    self.first_voc_infer = first_voc_et - first_am_et
                    self.first_response_time = first_voc_et - frontend_st
                    first_flag = 0

                yield sub_wav

其他判断增加了混合模型

Ray961123 commented 5 months ago

开发者你好,感谢关注 PaddleSpeech 开源项目,抱歉给你带来了不好的开发体验,目前开源项目维护人力有限,你可以尝试通过修改 PaddleSpeech 源码的方式自己解决,或请求开源社区其他开发者的协助。飞桨开源社区交流频道:飞桨AI Studio星河社区-人工智能学习与实训社区

jianghuakun commented 5 months ago

开发者你好,感谢关注 PaddleSpeech 开源项目,抱歉给你带来了不好的开发体验,目前开源项目维护人力有限,你可以尝试通过修改 PaddleSpeech 源码的方式自己解决,或请求开源社区其他开发者的协助。飞桨开源社区交流频道:飞桨AI Studio星河社区-人工智能学习与实训社区

你们那边也没人回复