Open EuphoriaCelestial opened 4 years ago
@EuphoriaCelestial
With a sampling_rate of 8000 your segment length is 2 seconds long.
I don't know what dataset you use but you might be training on a lot of padded data (files under 2 seconds long will be padded with zeros).
You should probably decrease segment length to 6144 or something along those lines and increase batch_size to 4
.
"sampling_rate": 8000,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"mel_fmin": 0.0,
"mel_fmax": 4000.0
just need to match Tacotron2 and you're good.
"n_flows": 18,
"n_group": 8,
"n_early_every": 4,
"n_early_size": 2,
You have 18 flows. You start with 8 channels, and every 4 flows you output 2 of the channels.
At flow 0 you have 8 channels
You cannot have 0 channels...
I'm pretty sure this config is not the config you used, as I don't think this config can start.
@CookiePPP oh yes, I forgot, I just changed n_flow and n_layer after it break, as this comment suggest but havent tried it yet https://github.com/NVIDIA/waveglow/issues/54#issuecomment-444383088 I was using default config file when gradient overflow happened
I don't know what dataset you use but you might be training on a lot of padded data (files under 2 seconds long will be padded with zeros).
No, I am sure all my audio file is above 2 secs, even after I trimed all silence at the beginning and end of file
@EuphoriaCelestial No problem. I'm not sure about the gradient overflow.
Try a lower learning rate for now.
(Original WaveGlow used 16000 segment_length
with batch_size
3 on 8 n_gpus = 384000 samples per iter)
You are using much less samples so Gradients might be quite noisy.
edit: I've used 1.2e-3 LR before when I had 576000 samples per iter, so I know there's some space to increase/decrease safely.
okay, now I am using those params:
"learning_rate": 5e-5, "segment_length": 6144, "n_flows": 12, "n_layers": 8,
but is it okay to use batch_size=1 ?
well, just a few steps in and it already messy
@EuphoriaCelestial loss scale around 256 is normal. I worry once it goes under 64.
okay, I will wait and report later
Even better, the model I'm training right now is using 64 loss scale :smile:
@CookiePPP 1 question, how did you know segment length need to be 6144 for 8k sampling rate? I want to know how to calculate, so I can train another one with 16k if the audio quality for 8k is too bad
@EuphoriaCelestial
I picked something similar to the original;
The original is 16000 segment_length
for a 22.05Khz sample rate audio file.
So I know that the original WaveGlow worked well with segments a little over half a second long.
6144 is a little over half of 8000, and 6144 can be divided by the hop_length
of 256 so there's no extra padding.
It does not have to be perfect, but too small makes it hard to learn low frequencies and multiples of the hop_length
waste less compute.
so, if I use 16k sample rate, segment_length will be somewhere around 15872, or lower, right? or maybe 8192? it should be near to sample rate or half of sample rate?
@EuphoriaCelestial I would use a little over half again (and multiple of hop_length) between 8192 and 12288 would be cool. You can use anything you want, but just don't make it too small is the main thing.
The DeepLearningExamples version uses segment_length
of 4000 for 22.05Khz, so it seems to be all over the place.
...
or it uses 8000 segment_length
?
I have no idea what's considered normal. I use half a second with my models and it works good enough for me.
@CookiePPP cool, thank you for the knowledge!
the training on 1050ti get scaled to 1.0, it starting to generate some audible voice but still noisy as broken radio
I also start a new training on 2080ti with same config except batch_size=24, but it also get gradient overflow and loss scaled to very small: after 5 epochs, no voice yet, only noise:
@rafaelvalle any experience with this?
it crashed again :( I decided to use fp32, which seem to solved the problem, no gradient overflow so far
I noticed that sigma is set to 1 in config.json
but it is 0.666 in tacotron2 inference file. Is it suppose to be like that? or they have to be the same value?
moreover, what is sigma?
I am training a Waveglow model from scratch with 8k sampling rate dataset and got this error after 1 epoch
Since Pytorch havent fully support RTX GPUs yet, I have to use my old 1050ti and set batch_size to 1; this is my
config.json
. Is batch_size too small cause the problem or I am using wrong audio params?{ "train_config": { "fp16_run": true, "output_directory": "checkpoints", "epochs": 100000, "learning_rate": 1e-4, "sigma": 1.0, "iters_per_checkpoint": 2000, "batch_size": 1, "seed": 1234, "checkpoint_path": "", "with_tensorboard": true }, "data_config": { "training_files": "train_files.txt", "segment_length": 16000, "sampling_rate": 8000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 4000.0 }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" },
}