NVIDIA / waveglow

A Flow-based Generative Network for Speech Synthesis
BSD 3-Clause "New" or "Revised" License
2.27k stars 529 forks source link

Aligning conditioning information and squeezed input waveform #193

Open adrienchaton opened 4 years ago

adrienchaton commented 4 years ago

Hello, I read your paper and codes, as well as the WaveFlow paper, and I have a question please.

Let's consider for simplicity a batch of squeezed audio x=(N,H,W) and a corresponding upsampled conditioning signal c of the same shape.

x is passed into the invertible convolution, split into (x_a,x_b) and we predict the affine transform of x_b from (x_a,c)

In WaveFlow, they use a reverse operation instead of mixing channels with the invertible convolution. And they note that the conditioning should be permuted and split accordingly in order to keep the height dimension aligned with x.

Does it make sense to you to pass c into the same invertible convolution and split into (c_a,c_b) so that we would predict the affine transform as (x_a,c_a) ? And for inference, as well pass the conditioning signal into the inverted convolution ?

Otherwise, it appears to me that the conditioning is more or less "averaged" over the squeeze dimension (height).

Thanks for your hints and work

rafaelvalle commented 4 years ago

In the setup you propose the affine parameters that modify c are independent from x. This is not very different from this in Waveglow, which is a more "expressive" transformation than permuting the dimensions.

In WaveGlow, the fold/unfold operation is done to both the conditioning signal and audio, as you can see here. This guarantees that conditioning signal and audio are temporarily aligned.

adrienchaton commented 4 years ago

Thank you for your comments !

Yes, the unfold operation on both audio x and conditioning c ensures that the squeeze channels are aligned for both. I was wondering about the alignement in between individual samples within the squeeze dimension. It is not explicit as proposed in WaveFlow but for every invertible convolution mixing x there is a cond_layer which is as well mixing c across the following residual convolution layers, all along the aligned squeeze dimension. So it makes sense !