NVIDIA / waveglow

A Flow-based Generative Network for Speech Synthesis
BSD 3-Clause "New" or "Revised" License
2.28k stars 529 forks source link

How long did you train the model #75

Closed li-xx-5 closed 5 years ago

li-xx-5 commented 5 years ago

hello,everyone.i had spent many days to train the model.how long did you take?Is there any other way to shorten the training time

Yeongtae commented 5 years ago

You need 870 epoch to train waveglow model. In waveglow paper, they train the network on 8 v100 gpus with a batch size of 24, for 580K itertation. 24*580K/16K(the number of samples on LJSpeech) = 870epoch

To reduce training time, you can reduce n_channels in wn_config from 512 to 256, expand batch size(if you use v100 to train, you can expand batch size to 10 from 3), use more gpus to run distributed.py

li-xx-5 commented 5 years ago

thank you very much @Yeongtae

truemanlee commented 5 years ago

I have 2x1080ti, and the training process takes at least 5 days to get a decent result.

yxt132 commented 5 years ago

I have 2x1080ti, and the training process takes at least 5 days to get a decent result.

did you use LJSpeech? Did you change the model size (i.e., number of nodes)?

We have 6 1080ti, the results are still noisy even after a week.

truemanlee commented 5 years ago

I have 2x1080ti, and the training process takes at least 5 days to get a decent result.

did you use LJSpeech? Did you change the model size (i.e., number of nodes)?

We have 6 1080ti, the results are still noisy even after a week.

Yes, I used both LJSpeech and other audios (the volume was normalized to the same level). I found that the total duration of training data (34 hours in my case) played an important role. Firstly, I normalized the mel spectrogram to range [0, 1]. Secondly, I didn't change the size of network but I used the accumulated gradient method to enlarge the batch size to 8. Finally, I don't know how noisy your results are.

yxt132 commented 5 years ago

thanks for your response! So did you figure out what is the cutoff point of the total duration at which the model may fail to converge? Is 20 hours of data long enough? How much improvements did you see by normalizing the mel-spectrogram? Do you mind sharing the code of using the accumulated gradient?

Yeongtae commented 5 years ago

When you synthesize using validation text manually, you can check difference between target audio and source audio by plotting audio values.

13000 sentences, 8 hours are enough in my case. You can check the samples https://github.com/NVIDIA/waveglow/issues/71.

It makes easy to converge. Here is my implementation for normalization to [-4, 4]. https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/layers.py https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/audio_processing.py https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/inference.py I refer to code of https://github.com/Rayhane-mamah/Tacotron-2

truemanlee commented 5 years ago

thanks for your response! So did you figure out what is the cutoff point of the total duration at which the model may fail to converge? Is 20 hours of data long enough? How much improvements did you see by normalizing the mel-spectrogram? Do you mind sharing the code of using the accumulated gradient?

Actually I didn't do much comparison experiments because of the limitations of machine. The followings are mostly based on my intuition.

I believe that the more training data is, the better results generated. In fact, the generated audios are based on the mel-spectrogram, so there are few constraints of the training data. You can just collect plenty of audios and use them to train.

As for the normalization of mel-spectrogram, it mostly served for the training process. Like some issue said, I also encountered the unstable training process with small batch size (in some case, the loss exploded and the model couldn't converge again). Whether the bigger batch size can stablize the training process remains uncertain for me. The normalization formula is torch.clamp((20 * log10(torch.clamp(mel, min=1e-5)) + 80) / 100, min=0, max=1) And I also clip the loss with the max value of 1000 to prevent the training loss exploding.

Sorry that I can't upload the generated audios cuz I resigned from the job and I stopped the related works. But you have 6 1080ti, so actually, you can implement the accumulated gradient method to enlarge the batch size to 24 (4 times slower compared the original paper) and remain other parameters unchanged.

Finally, the noise may be a nature property of the glow-based model. If you listen the demo audio that Nvidia released very carefully, there are still a few noise.

yxt132 commented 5 years ago

https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/inference.py

Thank you very much! I am looking at the normalizing function in your code:

def mel_normalize(x, max_abs_value=4.0, min_level_db=-100): return torch.clamp((2max_abs_value)(x - min_level_db)/(-min_level_db) - max_abs_value, min=-max_abs_value, max = max_abs_value)

Based on this function, if x = min_level_db, the rescaled value is equal to -max_abs_value; and if x = 0, the rescaled value is equal to max_abs_value. Anything bigger than 0 db will be clamped to max_abs_value. It seems a little strange to me. Is this correct? Or did I understand it wrong? Thanks.

yxt132 commented 5 years ago

thanks for your response! So did you figure out what is the cutoff point of the total duration at which the model may fail to converge? Is 20 hours of data long enough? How much improvements did you see by normalizing the mel-spectrogram? Do you mind sharing the code of using the accumulated gradient?

Actually I didn't do much comparison experiments because of the limitations of machine. The followings are mostly based on my intuition.

I believe that the more training data is, the better results generated. In fact, the generated audios are based on the mel-spectrogram, so there are few constraints of the training data. You can just collect plenty of audios and use them to train.

As for the normalization of mel-spectrogram, it mostly served for the training process. Like some issue said, I also encountered the unstable training process with small batch size (in some case, the loss exploded and the model couldn't converge again). Whether the bigger batch size can stablize the training process remains uncertain for me. The normalization formula is torch.clamp((20 * log10(torch.clamp(mel, min=1e-5)) + 80) / 100, min=0, max=1) And I also clip the loss with the max value of 1000 to prevent the training loss exploding.

Sorry that I can't upload the generated audios cuz I resigned from the job and I stopped the related works. But you have 6 1080ti, so actually, you can implement the accumulated gradient method to enlarge the batch size to 24 (4 times slower compared the original paper) and remain other parameters unchanged.

Finally, the noise may be a nature property of the glow-based model. If you listen the demo audio that Nvidia released very carefully, there are still a few noise.

thank you very much

Yeongtae commented 5 years ago

https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/inference.py

Thank you very much! I am looking at the normalizing function in your code:

def mel_normalize(x, max_abs_value=4.0, min_level_db=-100): return torch.clamp((2_max_absvalue)(x - min_level_db)/(-min_level_db) - max_abs_value, min=-max_abs_value, max = max_abs_value)

Based on this function, if x = min_level_db, the rescaled value is equal to -max_abs_value; and if x = 0, the rescaled value is equal to max_abs_value. Anything bigger than 0 db will be clamped to max_abs_value. It seems a little strange to me. Is this correct? Or did I understand it wrong? Thanks.

Check this mel_sepctrogram codes in layers.py. image

In my test, it can generate intelligible voice using griffin_lim(meslpectrogram(audio)). In my guess, The clipped values are already large enough or small enough.

yxt132 commented 5 years ago

https://github.com/Yeongtae/tacotron2/blob/prosody_encoder_test/inference.py

Thank you very much! I am looking at the normalizing function in your code: def mel_normalize(x, max_abs_value=4.0, min_level_db=-100): return torch.clamp((2_max_absvalue)(x - min_level_db)/(-min_level_db) - max_abs_value, min=-max_abs_value, max = max_abs_value) Based on this function, if x = min_level_db, the rescaled value is equal to -max_abs_value; and if x = 0, the rescaled value is equal to max_abs_value. Anything bigger than 0 db will be clamped to max_abs_value. It seems a little strange to me. Is this correct? Or did I understand it wrong? Thanks.

Check this mel_sepctrogram codes in layers.py. image

In my test, it can generate intelligible voice using griffin_lim(meslpectrogram(audio)). In my guess, The clipped values are already large enough or small enough.

Thank you for your quick response, Yeongtae! Really appreciate it. How is griffin_lim compared to waveglow in terms of generated audio quality and synthesis speed?

Yeongtae commented 5 years ago

Lower quality, slower speed. But it can generate distinguishable voice Just use debugging and monitoring tacotron2 results.

yxt132 commented 5 years ago

thanks. I will try to do some debugging. but I am still puzzled about the the "mel_normalize" function in your "audio_processing.py" file. Any positive "x value" (mel db) will lead to a scaled value larger than max_abs_value and therefore clamped to max_abs_value. I am wondering if that will result in information loss.

Yeongtae commented 5 years ago

@yxt132 feel free. This normalization make your model easy to converge. Human recognizes the magnitude of the audio as a log scale. Values in clipped range preserve important information.

fakufaku commented 5 years ago

I am having some speed problems running the training. I run the training on P100 GPU and one iteration takes around 8 seconds for batch size of three (single GPU). This doesn't seem consistent with the training time reported above (I feel one iteration needs to be faster).

How long does one iteration takes in your case ?

truemanlee commented 5 years ago

I am having some speed problems running the training. I run the training on P100 GPU and one iteration takes around 8 seconds for batch size of three (single GPU). This doesn't seem consistent with the training time reported above (I feel one iteration needs to be faster).

How long does one iteration takes in your case ?

About 0.7 second per iteration with single 1080ti (batch size 1). If multi 1080ti are used, about 1 second per iteration.

fakufaku commented 5 years ago

Thanks. Is this with the original number of channels n_channels=512 and all the other parameters as provided in original waveglow ?

Also, which version of pytorch, CUDA and CUDNN are you using ?

Sorry for the trouble, and thanks for your help!

truemanlee commented 5 years ago

Thanks. Is this with the original number of channels n_channels=512 and all the other parameters as provided in original waveglow ?

Also, which version of pytorch, CUDA and CUDNN are you using ?

Sorry for the trouble, and thanks for your help!

Yes, I didn't change the size of network. The version of pytorch is 0.4.1, and 8.0 for cuda. Did you use the fp16 for p100?

Yeongtae commented 5 years ago

If you change n-channels to 256, you can increase batch-size and make speed of each iteration faster. Without too much quality loosing.

fakufaku commented 5 years ago

Yes, I didn't change the size of network. The version of pytorch is 0.4.1, and 8.0 for cuda. Did you use the fp16 for p100?

Are you using fp16 ? I haven't tried that yet. I saw there is an option for inference, but it doesn't seem completely straightforward to do it for training. I gave it a go but run into some errors in the optimizer.

If you change n-channels to 256, you can increase batch-size and make speed of each iteration faster. Without too much quality loosing.

I had also previously reduced n_channels to 128 and increased the batch size to 12 (with single gpu). The strange part is that in my effort to improve the speed, I recompiled pytorch. When compiling with CUDNN, one iteration (with batch size 12) takes ~5s, and when compiling without it takes ~1.4s.

Is it just that P100 is that much slower ?

fakufaku commented 5 years ago

I just managed to have the training run with fp16, but now the loss is nan and there is close to no speed gain.

rafaelvalle commented 5 years ago

@fakufaku try the new fp16 setup using amp.