@Rayhane-mamah I was trying to understand the wavenet vocoder implementation and some of the layer dimensions didn't seem to match based on what I understood from the wavenet paper.
I wanted to check if you could shed light on some of these dimensions as may be I'm missing something?
1) The input kernel layer shows as shaped 1x1x128. Isn't the input to the input_convolution layer the mel-spectrum frames, which are 80 float values * 10,000, so the in_channels for this conv1d layer should be 80 instead of 1?
(as 10,000 is the max decoder steps defined as max_iters in hparams.py)
2) Is there reason for upsampling stride values to be [11, 25], like are the specific numbers 11 and 25 special or relevant in affecting other shapes/dimensions?
3) Why is the input-channels in residual_block_causal_conv 128 and residual_block_cin_conv 80? What exactly is their inputs? (e.g. is it mel-spectrum or just a raw floating point value?) Is the wavenet-vocoder generating just 1 float value per 1 input mel-spectrum frame of 80 floats?
@Rayhane-mamah I was trying to understand the wavenet vocoder implementation and some of the layer dimensions didn't seem to match based on what I understood from the wavenet paper.
I wanted to check if you could shed light on some of these dimensions as may be I'm missing something?
1) The input kernel layer shows as shaped 1x1x128. Isn't the input to the
input_convolution
layer the mel-spectrum frames, which are 80 float values * 10,000, so the in_channels for this conv1d layer should be 80 instead of 1?(as 10,000 is the max decoder steps defined as max_iters in hparams.py)
2) Is there reason for upsampling stride values to be [11, 25], like are the specific numbers 11 and 25 special or relevant in affecting other shapes/dimensions?
3) Why is the input-channels in residual_block_causal_conv 128 and residual_block_cin_conv 80? What exactly is their inputs? (e.g. is it mel-spectrum or just a raw floating point value?) Is the wavenet-vocoder generating just 1 float value per 1 input mel-spectrum frame of 80 floats?
The print that I see of the whole Wavenet network is shown below: