Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.28k stars 904 forks source link

is it right melspectrogram range? #159

Closed Yeongtae closed 6 years ago

Yeongtae commented 6 years ago

When I saw some melspectograms, I found some weird points about range of melspectogram values in [0, 4].

There are 3 melspectrogram files. 0819.zip

@Rayhane-mamah is it right? There seems to be a problem with the range of predicted mel values. So, it affects predicted wav values in synthesis time.

Is it only my problem caused by my mistakes?

Rayhane-mamah commented 6 years ago

Hey @Yeongtae yes that's normal actually. consult tensorboard histograms to see output mels and target mels distributions.

The main idea is that we scale our targets to [0, 4] and add a padding of -0.1 to explicitly model the padding as different from normal silence but, significant enough of some more silence. (Energy lower than silence can only be some more silence). 4 is rarely hit even on real data, and some information is clipped by 0 when going below the minimal allowed energy (likely to be noise).

When training, the model really doesn't have explicit limiters to not provide output values lower than 0 (or higher than 4 for that matter of fact). The output projections are linear transformations. Same applies for the post processing networks (5 conv layers w 4 tanh in between, no activation to the last layer). Thus the model tends to make negative predictions that also play the role of silence.

As long as wavenet is both trained with those unclipped GTA features, this causes no problem whatsoever during synthesis. Even if the model is trained on ground truth, some small negative values shouldn't be a problem if the model is trained right; Maybe for safety, one might add explicit clipping to [0, 4] prior to synthesis if the Wavenet is trained with ground truth.

As for my personal thought, those small Tacotron "mistakes" (which are normal to be there in any regression model) give away some information about the Tacotron model behavior, thus I prefer to keep them in the wavenet side, to have a chance of adapting the upsampling network to the Tacotron behavior (reduce the cross model accumulated error by actually training wavenet on those errors hoping it adapts to them). A best case scenario would be to completely train T2 altogether (from text to speech) using the same back propagation to update all parameters altogether. Sadly, GPUs can't hold such a big model, (wavenet would have to synthesize speed of multiple seconds instead of parts of it). I believe baidu discuss that this gives better results in their ClariNet paper. As a consequence, I prefer to keep natural flow between models with very minimal human intervention (at most, apply rescaling between supposedly [0,4] and [0,1] without clipping).

As a conclusion, this shouldn't be causing as big a problem as you think during synthesis (unless the Tacotron model is really off in a way it's making completely absurd predictions, which usually isn't the case). To verify this even further, you can visualize both original and GTA mels using our plot functions here or by checking out the distributions in tensorboard. There shouldn't be much differences other than high frequencies and silence ranges. The most important speech information should be modeled correctly. But overall, a good trained Tacotron model should have consistent spectrograms with the real ones (visually).

Notes:

I hope this answers most of your questions? If not, please feel free to ask anything else that seems to be causing problems with synthesis. Of course if you have a different insight, or maybe ideas of improving on the overall model quality, I am always open for improvements :)