NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.55k stars 369 forks source link

Choppy generation using pre-trained tacotron-gst model checkpoint #536

Open astricks opened 4 years ago

astricks commented 4 years ago

Hi,

I am using the tacotron-gst for speech generation (mag) and getting choppy generated audio, as someone else noted here. My inference output files are here.

I'm running inference on an NVIDIA tf docker container. Here are my inference logs.

The text I am trying to generate is from the M-AILABS dataset itself. My inference file contains the one line below:

en_US/by_book/female/judy_bieber/the_master_key/wavs/the_master_key_10_f000002|UNUSED|How Rob Served a Mighty King.

If I understand correctly, the provided checkpoint has been trained on the M-AILABS dataset, which means it has seen this particular sentence/audio pair.

  1. Is sample_step0_0_infer_mag.wav the quality to be expected?
  2. Can I swap out griffin-lim and use wavenet to improve the audio quality?
  3. Can you please share some Tacotron-GST audio samples (I found the non-GST tacotron samples in the docs) you have generated, so that we can know what to expect? My expectations are set by the Google tacotron team's audio samples on their webpage.
  4. In short - Is there any way to tell (from the output spectrogram image perhaps) what is causing the low quality generation, and what to change to improve quality? The model, or the vocoder? Both?
astricks commented 4 years ago

I'd really appreciate any advice on this