Open raymond00000 opened 4 years ago
1) Our repo is a re-implementation of the paper so I cannot speak to any claims by the paper. In theory that's how we hope tacotron-2 gst should work. In practice, it is very dependent on your training data.
2) Yes I highly doubt that our tacotron 2 gst will generalize to speakers outside the training set.
3) Infer_mag.wav is the griffin-lim reconstruction of the linear/magnitude spectrogram. Infer.wav is the griffin-lim reconstruction of the mel spectrogram which is converted to the linear spectrogram while a matmul with the mel basis. Infer_mag.wav should in general sound better than infer.wav
Hi,
I downloaded the checkpoint from here: https://nvidia.github.io/OpenSeq2Seq/html/speech-synthesis.html#speech-synthesis I followed the tutorial to generate an example audio.
My understanding is that the checkpoint was built by M-AILABS dataset. According to the paper section 7.2 , " to synthesize with a specific speaker’s voice, we can simply feed audio from that speaker as a reference signal.", the GST will become the speaker embedding. So in the inference step, I can supply a new English female audio to clone the speaker's voice.
Here is my question: (1) Is my understanding correct? (2) I applied an English female audio, but the output is still a male voice.. Is it because the female audio is not seen speaker? (3) What is the difference between the generated "infer_mag.wav" and "infer.wav"?
Thanks!