NVIDIA / waveglow

A Flow-based Generative Network for Speech Synthesis
BSD 3-Clause "New" or "Revised" License
2.27k stars 531 forks source link

Questions about mel params and training #82

Closed KeniMardira closed 5 years ago

KeniMardira commented 5 years ago
  1. In your experience, what are the things that you look out for when deciding for melspectrogram amplitude? Specifically I'm asking on why did you decide on not normalising the melspectrogram in your experiment.

  2. In the paper you mentioned that in your experiment you trained the network with a batch size of 24 for 580.000 iterations. I would like to confirm that this is also the case for the checkpoint provided?

  3. With the GPU that I have at the moment (K80 and P100) and using a single GPU for each experiment, I can only manage a batch size of around 3 to 5, depending on the parameter for the WN network. In your opinion, would I be able to achieve similar results with lower batch size?

  4. If not for Question 3, is there a way for me to get better results without having to use multi-gpus?

rafaelvalle commented 5 years ago
  1. WaveGlow is an extremely large and non-linear function that should be able to learn the map from audio to z and vice-versa regardless of the spectrogram amplitude being normalized. Normalizing the amplitude probably makes it easier for the model to learn the map.

  2. Yes, the checkpoint provided is at ~580k iterations, if not exactly.

  3. In expectation you should be able to achieve similar results by using proper learning rate and waiting longer.

KeniMardira commented 5 years ago

@rafaelvalle Thanks for your answer. Recently I was able to train a model with less ringing, simply by replacing the upsampling method with a nearest neighbour upsampling.

I found that the upsampling method used in this repo is somewhat a bit arbitrary, where the conv transpose for upsampling outputs some "residual" samples at the end. Was there any reason for using conv transpose?

rafaelvalle commented 5 years ago

Can you share with us samples from your nearest neighbor upsampling approach?

rafaelvalle commented 5 years ago

Closing due to inactivity.