[Question] Tacotron or Tacotron GST output spectrograms to condition Wavenet

Hi,

Has anyone tried to use Wavenet instead of Griffin-Lim waveform reconstruction during inference in Tacotron or Tacotron GST models? I'm trying to figure out if there is an easy way to do this or whether one would need to implement a proper decoder in Wavenet instead of the default 'FakeDecoder'.

Outline of my recipe: 1) Train a Tacotron or Tacotron GST model to predict spectrograms from text or from phoneme sequences 2) Train a separate Wavenet model conditioned on spectrograms 3) (This is what I'm trying to figure out) - Feed spectrogram outputs of (1) into Wavenet model from (2) to synthesize speech

It wasn't clear to me if the current implementation supports (3) by simply changing config files so any guidance or ideas would be highly appreciated.

Thanks!

Oytun

NVIDIA / OpenSeq2Seq

[Question] Tacotron or Tacotron GST output spectrograms to condition Wavenet #395