espnet / espnet_onnx

Onnx wrapper for espnet infrernce model
MIT License
156 stars 23 forks source link

How to include stats.h5 of PWG Vocoder during ONXX conversion for TTS #94

Open anirpipi opened 1 year ago

anirpipi commented 1 year ago

Hi.. I am trying to convert pretrained LJSpeech TTS model based on _kan-bayashi/ljspeechfastspeech2 and _parallel_wavegan/ljspeech_parallelwavegan.v1 using the below code:

########################### ONNX Conversion ############################

from espnet2.bin.tts_inference import Text2Speech from espnet_onnx.export import TTSModelExport

m = TTSModelExport()

tag_exp = "exp/tts_train_fastspeech2_raw_phn_tacotron_g2p_en_no_space/train.loss.ave_5best.pth" train_config="exp/tts_train_fastspeech2_raw_phn_tacotron_g2p_en_no_space/config.yaml"

vocoder_tag = 'parallel_wavegan.v1/checkpoint-400000steps.pkl' vocoder_config= 'parallel_wavegan.v1/config.yml'

text2speech = Text2Speech.from_pretrained( train_config=train_config, model_file=tag_exp, vocoder_file=vocoder_tag, vocoder_config=vocoder_config, speed_control_alpha=1.0, always_fix_seed=False )

tag_name = 'ljspeech_pretrained' m.export(text2speech, tag_name, quantize=True)

########################### Inference ############################

from espnet_onnx import Text2Speech import soundfile import numpy as np import time

text2speech = Text2Speech(tag_name)

text = 'hello world!' wav = wav['wav']

soundfile.write("ljspeech_pretrained_test.wav", wav, 22050, "PCM_16")

######################################################################

On synthesizing, the audio quality is very low. I realized that the converted ONNX folder did not have stats.h5 file from the pwg vocoder folder. _~/.cache/espnet_onnx/ljspeesch_pretrained/: config.yaml featsstats.npz full quantize

Can anyone please help how to include the stats.h5 during inference using espnet_onnx

Masao-Someki commented 1 year ago

Hi @anirpipi, sorry for the late reply, and thank you for reporting the issue. It may be a bug, so I would like to check this problem. It seems you are using your own trained model, can you confirm that this issue still happens with the published models? If it's reproducible, I will download the model and investigate this.

anirpipi commented 1 year ago

Hi..Thanks for the response. Its the same case with pre-trained models also.. For VITS, its fine but for FastSpeech2+PWG, the problem occurs.. Can you please look into it once Thanks in advance