NVIDIA / waveglow

A Flow-based Generative Network for Speech Synthesis
BSD 3-Clause "New" or "Revised" License
2.28k stars 530 forks source link

Gradient overflow when training #205

Open EuphoriaCelestial opened 4 years ago

EuphoriaCelestial commented 4 years ago

I am training a Waveglow model from scratch with 8k sampling rate dataset and got this error after 1 epoch ảnh

Since Pytorch havent fully support RTX GPUs yet, I have to use my old 1050ti and set batch_size to 1; this is my config.json. Is batch_size too small cause the problem or I am using wrong audio params?

{ "train_config": { "fp16_run": true, "output_directory": "checkpoints", "epochs": 100000, "learning_rate": 1e-4, "sigma": 1.0, "iters_per_checkpoint": 2000, "batch_size": 1, "seed": 1234, "checkpoint_path": "", "with_tensorboard": true }, "data_config": { "training_files": "train_files.txt", "segment_length": 16000, "sampling_rate": 8000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 4000.0 }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" },

"waveglow_config": {
    "n_mel_channels": 80,
    "n_flows": 18,
    "n_group": 8,
    "n_early_every": 4,
    "n_early_size": 2,
    "WN_config": {
        "n_layers": 4,
        "n_channels": 256,
        "kernel_size": 3
    }
}

}

CookiePPP commented 4 years ago

@EuphoriaCelestial With a sampling_rate of 8000 your segment length is 2 seconds long. I don't know what dataset you use but you might be training on a lot of padded data (files under 2 seconds long will be padded with zeros). You should probably decrease segment length to 6144 or something along those lines and increase batch_size to 4.


"sampling_rate": 8000,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"mel_fmin": 0.0,
"mel_fmax": 4000.0

just need to match Tacotron2 and you're good.


    "n_flows": 18,
    "n_group": 8,
    "n_early_every": 4,
    "n_early_size": 2,

You have 18 flows. You start with 8 channels, and every 4 flows you output 2 of the channels.

At flow 0 you have 8 channels

You cannot have 0 channels...

I'm pretty sure this config is not the config you used, as I don't think this config can start.

EuphoriaCelestial commented 4 years ago

@CookiePPP oh yes, I forgot, I just changed n_flow and n_layer after it break, as this comment suggest but havent tried it yet https://github.com/NVIDIA/waveglow/issues/54#issuecomment-444383088 I was using default config file when gradient overflow happened

EuphoriaCelestial commented 4 years ago

I don't know what dataset you use but you might be training on a lot of padded data (files under 2 seconds long will be padded with zeros).

No, I am sure all my audio file is above 2 secs, even after I trimed all silence at the beginning and end of file

CookiePPP commented 4 years ago

@EuphoriaCelestial No problem. I'm not sure about the gradient overflow.

Try a lower learning rate for now. (Original WaveGlow used 16000 segment_length with batch_size 3 on 8 n_gpus = 384000 samples per iter) You are using much less samples so Gradients might be quite noisy.

edit: I've used 1.2e-3 LR before when I had 576000 samples per iter, so I know there's some space to increase/decrease safely.

EuphoriaCelestial commented 4 years ago

okay, now I am using those params:

"learning_rate": 5e-5, "segment_length": 6144, "n_flows": 12, "n_layers": 8,

but is it okay to use batch_size=1 ?

EuphoriaCelestial commented 4 years ago

well, just a few steps in and it already messy ảnh

CookiePPP commented 4 years ago

@EuphoriaCelestial loss scale around 256 is normal. I worry once it goes under 64.

EuphoriaCelestial commented 4 years ago

okay, I will wait and report later

CookiePPP commented 4 years ago

Screenshot from 2020-06-05 04-41-53 Even better, the model I'm training right now is using 64 loss scale :smile:

EuphoriaCelestial commented 4 years ago

@CookiePPP 1 question, how did you know segment length need to be 6144 for 8k sampling rate? I want to know how to calculate, so I can train another one with 16k if the audio quality for 8k is too bad

CookiePPP commented 4 years ago

@EuphoriaCelestial I picked something similar to the original; The original is 16000 segment_length for a 22.05Khz sample rate audio file.

So I know that the original WaveGlow worked well with segments a little over half a second long. 6144 is a little over half of 8000, and 6144 can be divided by the hop_length of 256 so there's no extra padding.

It does not have to be perfect, but too small makes it hard to learn low frequencies and multiples of the hop_length waste less compute.

EuphoriaCelestial commented 4 years ago

so, if I use 16k sample rate, segment_length will be somewhere around 15872, or lower, right? or maybe 8192? it should be near to sample rate or half of sample rate?

CookiePPP commented 4 years ago

@EuphoriaCelestial I would use a little over half again (and multiple of hop_length) between 8192 and 12288 would be cool. You can use anything you want, but just don't make it too small is the main thing.


https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/waveglow/arg_parser.py#L53

The DeepLearningExamples version uses segment_length of 4000 for 22.05Khz, so it seems to be all over the place.

...

https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/scripts/train_waveglow.sh

or it uses 8000 segment_length?

I have no idea what's considered normal. I use half a second with my models and it works good enough for me.

EuphoriaCelestial commented 4 years ago

@CookiePPP cool, thank you for the knowledge!

EuphoriaCelestial commented 4 years ago

update:

the training on 1050ti get scaled to 1.0, it starting to generate some audible voice but still noisy as broken radio ảnh

I also start a new training on 2080ti with same config except batch_size=24, but it also get gradient overflow and loss scaled to very small: ảnh after 5 epochs, no voice yet, only noise: ảnh

CookiePPP commented 4 years ago

@rafaelvalle any experience with this?

EuphoriaCelestial commented 4 years ago

it crashed again :( I decided to use fp32, which seem to solved the problem, no gradient overflow so far

EuphoriaCelestial commented 4 years ago

I noticed that sigma is set to 1 in config.json but it is 0.666 in tacotron2 inference file. Is it suppose to be like that? or they have to be the same value? moreover, what is sigma?