Tacotron-2 GST expected result

NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

Apache License 2.0

1.54k stars 372 forks source link

Hi,

I downloaded the checkpoint from here: https://nvidia.github.io/OpenSeq2Seq/html/speech-synthesis.html#speech-synthesis I followed the tutorial to generate an example audio.

My understanding is that the checkpoint was built by M-AILABS dataset. According to the paper section 7.2 , " to synthesize with a specific speaker’s voice, we can simply feed audio from that speaker as a reference signal.", the GST will become the speaker embedding. So in the inference step, I can supply a new English female audio to clone the speaker's voice.

Here is my question: (1) Is my understanding correct? (2) I applied an English female audio, but the output is still a male voice.. Is it because the female audio is not seen speaker? (3) What is the difference between the generated "infer_mag.wav" and "infer.wav"?

Thanks!

NVIDIA / OpenSeq2Seq

Tacotron-2 GST expected result #505