Problem with very short and noisy audio during inference when providing xvector embeddings

Hello,

I am pretty new to ESPnet and I am attempting to perform inference using the vctk_tts_train_xvector_transformer_raw_phn_tacotron_g2p_en_no_space_train.loss.ave pretrained model.

Steps Taken:

I used speechbrain/spkrec-xvect-voxceleb to create speaker embeddings for specific voices.
I provided one of these embeddings to the pretrained TTS model.

The problem is that the generated audios are extremely short (0.125 or 0.013 seconds) and sound noisy.

I am using the Python API. I only provided text and spembs fields when calling the Text2Speech class. I also have successfully used the Python API with other pretrained models that do not require speaker embeddings. I am unsure if there are additional arguments or steps required when using this specific model with speaker embeddings.

If more information is needed, I am happy to provide it. Has anyone experienced a similar issue or can provide guidance on how to resolve this?

Thank you for your assistance,

espnet / espnet_model_zoo

Problem with very short and noisy audio during inference when providing xvector embeddings #80