Closed KeniMardira closed 5 years ago
WaveGlow is an extremely large and non-linear function that should be able to learn the map from audio to z and vice-versa regardless of the spectrogram amplitude being normalized. Normalizing the amplitude probably makes it easier for the model to learn the map.
Yes, the checkpoint provided is at ~580k iterations, if not exactly.
In expectation you should be able to achieve similar results by using proper learning rate and waiting longer.
@rafaelvalle Thanks for your answer. Recently I was able to train a model with less ringing, simply by replacing the upsampling method with a nearest neighbour upsampling.
I found that the upsampling method used in this repo is somewhat a bit arbitrary, where the conv transpose for upsampling outputs some "residual" samples at the end. Was there any reason for using conv transpose?
Can you share with us samples from your nearest neighbor upsampling approach?
Closing due to inactivity.
In your experience, what are the things that you look out for when deciding for melspectrogram amplitude? Specifically I'm asking on why did you decide on not normalising the melspectrogram in your experiment.
In the paper you mentioned that in your experiment you trained the network with a batch size of 24 for 580.000 iterations. I would like to confirm that this is also the case for the checkpoint provided?
With the GPU that I have at the moment (K80 and P100) and using a single GPU for each experiment, I can only manage a batch size of around 3 to 5, depending on the parameter for the WN network. In your opinion, would I be able to achieve similar results with lower batch size?
If not for Question 3, is there a way for me to get better results without having to use multi-gpus?