Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.28k stars 904 forks source link

Rescaling WAV before training? #69

Closed ghost closed 6 years ago

ghost commented 6 years ago

hi, I do not know why the WAV should be rescaled during the preprocessing procedure, in this way:

if hparams.rescale:
        wav = wav / np.abs(wav).max() * hparams.rescaling_max

Could you tell me why, or is there any keyword I could search via Google? Thank you so much.

harlyh commented 6 years ago

I've been looking at your question for a while now, and since nobody has replied to you in last two weeks, I'm gonna try my best to help you.

But first I just wanted to tell you that it's unclear to me whether your question is regarding rescaling in general, or is it the particular method above... I'm not native English speaker, therefore I might misinterpret your question.

Anyway, If it's the general question about rescaling then my understanding is that we'd want all training data to be normalised, to use up maximum dynamic range and (I'm going to use an extremely simplified explanation here) to ensure that during training model would "hear" the same word in different sentences with same audio level. Normalisation and/or compression is one of the most commonly used techniques for bringing varying speech phrases all to the same loudness level, if you know what I mean.

Rayhane-mamah commented 6 years ago

Just going to add something on @harlyh brilliant answer (thanks for that)

We rescale because it is assumed in Wavenet training that wavs are in [-1, 1] when computing the mixture loss. This is mainly coming from PixelCNN implementation.

hope this answers your question :)