is it right melspectrogram range?

Hey @Yeongtae yes that's normal actually. consult tensorboard histograms to see output mels and target mels distributions.

The main idea is that we scale our targets to [0, 4] and add a padding of -0.1 to explicitly model the padding as different from normal silence but, significant enough of some more silence. (Energy lower than silence can only be some more silence). 4 is rarely hit even on real data, and some information is clipped by 0 when going below the minimal allowed energy (likely to be noise).

When training, the model really doesn't have explicit limiters to not provide output values lower than 0 (or higher than 4 for that matter of fact). The output projections are linear transformations. Same applies for the post processing networks (5 conv layers w 4 tanh in between, no activation to the last layer). Thus the model tends to make negative predictions that also play the role of silence.

As long as wavenet is both trained with those unclipped GTA features, this causes no problem whatsoever during synthesis. Even if the model is trained on ground truth, some small negative values shouldn't be a problem if the model is trained right; Maybe for safety, one might add explicit clipping to [0, 4] prior to synthesis if the Wavenet is trained with ground truth.

As for my personal thought, those small Tacotron "mistakes" (which are normal to be there in any regression model) give away some information about the Tacotron model behavior, thus I prefer to keep them in the wavenet side, to have a chance of adapting the upsampling network to the Tacotron behavior (reduce the cross model accumulated error by actually training wavenet on those errors hoping it adapts to them). A best case scenario would be to completely train T2 altogether (from text to speech) using the same back propagation to update all parameters altogether. Sadly, GPUs can't hold such a big model, (wavenet would have to synthesize speed of multiple seconds instead of parts of it). I believe baidu discuss that this gives better results in their ClariNet paper. As a consequence, I prefer to keep natural flow between models with very minimal human intervention (at most, apply rescaling between supposedly [0,4] and [0,1] without clipping).

As a conclusion, this shouldn't be causing as big a problem as you think during synthesis (unless the Tacotron model is really off in a way it's making completely absurd predictions, which usually isn't the case). To verify this even further, you can visualize both original and GTA mels using our plot functions here or by checking out the distributions in tensorboard. There shouldn't be much differences other than high frequencies and silence ranges. The most important speech information should be modeled correctly. But overall, a good trained Tacotron model should have consistent spectrograms with the real ones (visually).

Notes:

If trying to use our Tacotron model outputs with r9y9's wavenet, you should probably clip outputs to [0, 4] then rescale to [0,1] which is the range expected on his wavenet. we prefer to train Tacotron on large range to add more model awareness to speech details (in our experiments, this improvement made convergence much faster than [0, 1] targets (mainly because of L2 loss) and corrected multiple inintelligible speech cases). Discussions about this decision has been made on multiple occasions in few older issues (they're 100+ issues so it's a bit tough to find the correct discussion now...)
If instead you are using the entire project in our repo, you have the following parameter which rescales mels to [0, 1] at entry to wavenet (without doing any clipping) as it seems that [0, 1] scaled mels help getting best audio quality with wavenet (according to ClariNet paper): https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/hparams.py#L43

I hope this answers most of your questions? If not, please feel free to ask anything else that seems to be causing problems with synthesis. Of course if you have a different insight, or maybe ideas of improving on the overall model quality, I am always open for improvements :)

Rayhane-mamah / Tacotron-2

is it right melspectrogram range? #159