PaddlePaddle / Parakeet

PAddle PARAllel text-to-speech toolKIT (supporting Tacotron2, Transformer TTS, FastSpeech2/FastPitch, SpeedySpeech, WaveFlow and Parallel WaveGAN)
Other
598 stars 83 forks source link

need help understanding waveflow: permutations on height dimension #41

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hello, this is a great work! I'm having trouble understanding the part of waveflow where it uses permutation on the height dimension. Does this not break the causality of conv2d over height dimension? Any insights would be appreciated!

iclementine commented 3 years ago

Waveflow model trains with folded audio clips. It folds audio shape(T,) into shape(T/h, h) and transposes it into (h, w) where w=T/h.

So waveflow model synthesizes in an autoregressive manner, but only in h steps for each flow.

Waveflow consists of several autoregressive flows. For every flow, it is autoregressive(causal) in height. In this sense, it does not break the causality.

For flow[i], whose input is z[i] (shape(h, w)) and output is z[i+1] (shape(h, w)). It is autoregressive in height dimension.

As for the whole model, we transform z flow by flow. z[0] -> z[1] -> z[2] ...-> z[n]. And at each flow, we transform z[i] into z[i+1] step by step in height. z[2] have to be generated after the whole z[1] is generated. So for a waveflow model with N flows, you need h * N steps to generate the audio.

We are not doing it step by step in height dimension. In this manner, the first rows of each z is generated, then the second rows of each z is generated. But we are not working in this manner. (If no permutation along h dimension is used, waveflow can actually synthesize in this manner.)

Clarinet uses similar trick, which proves to improve the quality.

Did I make it clear?

ghost commented 3 years ago

Crystal clear. Thank you!