global and local conditioning?

shreyasnivas commented 6 years ago

Hello,

What is the cond_input.pt actually representing as per the google drive file you provided?

We understand that the dimensions are : 2xR , layers, batch_size, Sample_len

How can we create our own inputs for conditioning?

Thanks, Shreyas

RPrenger commented 6 years ago

Hi Shreyas,

Our WaveNet takes a mel-spectrogram representation of the audio that is put through an upsampling layer (we are using a TacoTron2-like architecture for the complete system). The upsampling layer takes the mel-spectrogram representation and extends it to num_samples length. Then this input is put through each of the conditional convolutions (1 per dilation layer). That results in the 2xR by layers by batch_size by sample_len cond_input.pt. All of these calculations can be done quickly in parallel, and the nv-wavenet code doesn't deal with them.

For generating your own audio, if you have a WaveNet that is trained you'll already have upsampling layers and conditional convolutions for incorporating your input to each layer. So you'll need to calculate from your features a similiar 2xR , layers, batch_size, Sample_len tensor.

If you want to generate new samples with our WaveNet we'd need to include the upsampling and conditional convolution parameters, as well as some audio processing code for generating the mel-spectrograms we use. Then you'd be able to generate wavefiles from new mel-spectrograms with our WaveNet. But we're working on open-sourcing our full PyTorch WaveNet code that works with our already open sourced TacoTron2 code and nv-wavenet. This will allow you to go from text to audio with fast inference.

PetrochukM commented 6 years ago

Concerning [2xR , layers, batch_size, sample_length], I am assuming the first R chunk conditions the tanh while the last R chunk conditions the sigmoid, is that right?

For local and global conditioning, I'd need to do something like:

global_features.shape # [R, batch_size, signal_length]
local_features.shape # [R, batch_size, signal_length]
num_layers # 24

# Interweave features; the first ``R`` chunk conditions the nonlinearity
# while the last ``R`` chunk conditions the ``sigmoid`` gate.
global_features_left, global_features_right = tuple(torch.chunk(global_features, 2, dim=0))
local_features_left, local_features_right = tuple(torch.chunk(local_features, 2, dim=0))

# conditional_features [2 * R, batch_size, signal_length]
conditional_features = torch.cat(global_features_left, local_features_left,
                                  global_features_right, local_features_right)

# [2 * R, batch_size, signal_length] →
# [2 * R, batch_size, 1, signal_length]
conditional_features = conditional_features.unsqueeze(2)

# [2 * R, batch_size, 1, signal_length] →
# [2 * R, batch_size, L, signal_length]
conditional_features = conditional_features.repeat(1, 1, num_layers, 1)

RPrenger commented 6 years ago

@PetrochukM Excellent question. Yes, just like you said the first R channels are for the tanh and the next R sigmoid dimensions. I'll add that to the README.

For your code it looks correct if you're using the same conditional convolutions at every layer of WaveNet (which result in global_features and local_features at the top)

PetrochukM commented 6 years ago

@RPrenger Yup! Awesome! The idea is to use the same conditioning at every layer.

I believe the WaveNet paper uses the same local and global conditioning on every layer, is that right?

RPrenger commented 6 years ago

@PetrochukM We don't have a reference implementation of the original WaveNet, but the paper notation implies they had a different weight matrix at every layer V_k where k is the layer index.

The equations under equation 3 here: https://arxiv.org/pdf/1609.03499.pdf

That would cause the activations to be different at every layer. Our WaveNet implementation uses different weight matrices at each layer as well.

PetrochukM commented 6 years ago

@RPrenger Gotcha. Thanks! Those small details are easy to miss.

NVIDIA / nv-wavenet

global and local conditioning? #9