I am pretty new to ESPnet and I am attempting to perform inference using the vctk_tts_train_xvector_transformer_raw_phn_tacotron_g2p_en_no_space_train.loss.ave pretrained model.
Steps Taken:
I used speechbrain/spkrec-xvect-voxceleb to create speaker embeddings for specific voices.
I provided one of these embeddings to the pretrained TTS model.
The problem is that the generated audios are extremely short (0.125 or 0.013 seconds) and sound noisy.
I am using the Python API. I only provided text and spembs fields when calling the Text2Speech class. I also have successfully used the Python API with other pretrained models that do not require speaker embeddings. I am unsure if there are additional arguments or steps required when using this specific model with speaker embeddings.
If more information is needed, I am happy to provide it. Has anyone experienced a similar issue or can provide guidance on how to resolve this?
Hello,
I am pretty new to ESPnet and I am attempting to perform inference using the
vctk_tts_train_xvector_transformer_raw_phn_tacotron_g2p_en_no_space_train.loss.ave
pretrained model.Steps Taken:
speechbrain/spkrec-xvect-voxceleb
to create speaker embeddings for specific voices.The problem is that the generated audios are extremely short (0.125 or 0.013 seconds) and sound noisy.
I am using the Python API. I only provided text and spembs fields when calling the Text2Speech class. I also have successfully used the Python API with other pretrained models that do not require speaker embeddings. I am unsure if there are additional arguments or steps required when using this specific model with speaker embeddings.
If more information is needed, I am happy to provide it. Has anyone experienced a similar issue or can provide guidance on how to resolve this?
Thank you for your assistance,