NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 372 forks source link

Tacotron-2 GST expected result #505

Open raymond00000 opened 4 years ago

raymond00000 commented 4 years ago

Hi,

I downloaded the checkpoint from here: https://nvidia.github.io/OpenSeq2Seq/html/speech-synthesis.html#speech-synthesis I followed the tutorial to generate an example audio.

My understanding is that the checkpoint was built by M-AILABS dataset. According to the paper section 7.2 , " to synthesize with a specific speaker’s voice, we can simply feed audio from that speaker as a reference signal.", the GST will become the speaker embedding. So in the inference step, I can supply a new English female audio to clone the speaker's voice.

Here is my question: (1) Is my understanding correct? (2) I applied an English female audio, but the output is still a male voice.. Is it because the female audio is not seen speaker? (3) What is the difference between the generated "infer_mag.wav" and "infer.wav"?

Thanks!

blisc commented 4 years ago

1) Our repo is a re-implementation of the paper so I cannot speak to any claims by the paper. In theory that's how we hope tacotron-2 gst should work. In practice, it is very dependent on your training data.

2) Yes I highly doubt that our tacotron 2 gst will generalize to speakers outside the training set.

3) Infer_mag.wav is the griffin-lim reconstruction of the linear/magnitude spectrogram. Infer.wav is the griffin-lim reconstruction of the mel spectrogram which is converted to the linear spectrogram while a matmul with the mel basis. Infer_mag.wav should in general sound better than infer.wav