espnet / espnet_model_zoo

ESPnet Model Zoo
Apache License 2.0
243 stars 41 forks source link

Problem with very short and noisy audio during inference when providing xvector embeddings #80

Open Phlayne opened 3 days ago

Phlayne commented 3 days ago

Hello,

I am pretty new to ESPnet and I am attempting to perform inference using the vctk_tts_train_xvector_transformer_raw_phn_tacotron_g2p_en_no_space_train.loss.ave pretrained model.

Steps Taken:

The problem is that the generated audios are extremely short (0.125 or 0.013 seconds) and sound noisy.

I am using the Python API. I only provided text and spembs fields when calling the Text2Speech class. I also have successfully used the Python API with other pretrained models that do not require speaker embeddings. I am unsure if there are additional arguments or steps required when using this specific model with speaker embeddings.

If more information is needed, I am happy to provide it. Has anyone experienced a similar issue or can provide guidance on how to resolve this?

Thank you for your assistance,

sw005320 commented 3 days ago

maybe, I'm wrong, but the model might use the Kaldi xvector, not speechbrain one.