Closed ghost closed 3 years ago
Waveflow model trains with folded audio clips. It folds audio shape(T,)
into shape(T/h, h)
and transposes it into (h, w)
where w=T/h
.
So waveflow model synthesizes in an autoregressive manner, but only in h
steps for each flow.
Waveflow consists of several autoregressive flows. For every flow, it is autoregressive(causal) in height. In this sense, it does not break the causality.
For flow[i], whose input is z[i] (shape(h, w)
) and output is z[i+1] (shape(h, w)
). It is autoregressive in height dimension.
As for the whole model, we transform z flow by flow. z[0] -> z[1] -> z[2] ...-> z[n]. And at each flow, we transform z[i] into z[i+1] step by step in height. z[2] have to be generated after the whole z[1] is generated. So for a waveflow model with N flows, you need h * N
steps to generate the audio.
We are not doing it step by step in height dimension. In this manner, the first rows of each z is generated, then the second rows of each z is generated. But we are not working in this manner. (If no permutation along h dimension is used, waveflow can actually synthesize in this manner.)
Clarinet uses similar trick, which proves to improve the quality.
Did I make it clear?
Crystal clear. Thank you!
Hello, this is a great work! I'm having trouble understanding the part of waveflow where it uses permutation on the height dimension. Does this not break the causality of conv2d over height dimension? Any insights would be appreciated!