NVIDIA / waveglow

A Flow-based Generative Network for Speech Synthesis
BSD 3-Clause "New" or "Revised" License
2.27k stars 529 forks source link

Loss instability during training #123

Closed julianzaidi closed 5 years ago

julianzaidi commented 5 years ago

I have already trained a WaveGlow model from scratch using LJ Speech dataset and everything worked well during training.

I now try to train a new model using a private dataset that contains only 2 hours of speech. Some audios are inferior to segment_length=16000 (approximatively 10 audios for a dataset composed of 2300 audios). This training is performed in FP32 and, except for batch_size=24, I use the same hyper-parameters than the one in config.json.

Training loss decreases slowly during 120k iterations (which represents a lot of epochs for my small dataset) but further iterations lead to two types of errors:

I already used this private dataset using WaveNet and everything worked well, which assumes that this dataset is not corrupted.

I tried to decrease the learning rate but instability still persists. Any insight or help to understand these problems would be greatly appreciated.

emanzanoaxa commented 5 years ago

I have the same problem, after ~30k iterations the loss starts to jump to positive values and finally ends up with NaN loss. I'm using sigma = sqrt(0.5) and learning rate = 1e-4 (also tried lowering to 5e-5 and sigma 1.0).

My dataset has more than 10 hours of speech (9000 audios), but none of my audio clips is longer than 16000 milliseconds. Also, I have removed empty audio files and the audios are amplified/normalized using ffmpeg-normalize before training.

I don't know what the problem can be, all audio files must be longer than segment_length?

rafaelvalle commented 5 years ago

Can you check during training while computing the loss if any of the predicted z values are equal to zero?

emanzanoaxa commented 5 years ago

Not sure, but in case some Z value is equal to zero what could be the cause?

julianzaidi commented 5 years ago

I can check that @rafaelvalle. I have already restarted the training from scratch using a smaller batch_size and I have modified the DataLoader in order to ignore audios that are inferior to segment_length=16000. I will see how the loss behaves with this new setting and as soon as I have a new anomaly I will start to track predicted z values. I will keep you informed.

@emanzanoaxa, all I know is that a z value that is equal to 0 suggests that the model fully contracted the input data (audio) into the highest density region of the latent space (i.e 0), which is bad for reconstruction. The second term of the loss (determinant of the Jacobian matrix of the transform) is here to penalize such a contraction. In the case of a contraction, the determinant of the Jacobian would be 0, which would cause an infinite error. So a predicted z value that is equal to 0 would suggest that the second term of the loss was "ignored" during training. Why ? good question ... Maybe an outlier batch, a learning rate that is too big .. Need to be investigated.

emanzanoaxa commented 5 years ago

@julianzaidi about the learning rate, I've tried multiple values and always ends up the same. The loss is not going under ~-5.3 and finally it goes NaN on my custom dataset. About the segment_length I'm confused, LJSpeech dataset (what is suggested to be used with waveglow) has an average audio duration of 10 seconds per clip, so the majority of the clips won't be longer than 16000ms (16s). So why the default config has segment_length=16000 if the dataset clips are shorter than that? I don't know the effect of lowering the segment_lenght value on the final quality of a well trained model, you know if I can reduce it without losing quality on the result?

nshmyrev commented 5 years ago

so the majority of the clips won't be longer than 16000ms (16s). So why the default config has segment_length=16000

16000 is in samples, not milliseconds. So the minimum size is 1 second or it will be padded with 0.

Consider also https://github.com/NVIDIA/waveglow/issues/95 on how segment_length causes NaNs

julianzaidi commented 5 years ago

I also confirm that since I modified mel2samp to avoid audios that are inferior to segment_length, loss is stable. As @nshmyrev mentioned in #95, padding with zeros in mel2samp might be the problem.

rafaelvalle commented 5 years ago

Thanks fo sharing this, @julianzaidi.

rafaelvalle commented 5 years ago

Beware that the default sampling rate in the config is 22050, so 16000 samples is shorter than 1 second.

emanzanoaxa commented 5 years ago

Oh I see, I thought 16000 value was ms not samples. @julianzaidi can you share the modified code to ignore audios with less than 16000 segment_length? I did this dummy fix:

class Mel2Samp(torch.utils.data.Dataset):
    """
    This is the main class that calculates the spectrogram and returns the
    spectrogram, audio pair.
    """

    def __init__(self, training_files, segment_length, filter_length,
                 hop_length, win_length, sampling_rate, mel_fmin, mel_fmax):
        self.audio_files = files_to_list(training_files)
        for file in self.audio_files:
            audio_data, sample_r = load_wav_to_torch(file)
            if audio_data.size(0) < segment_length:
                self.audio_files.remove(file)
        random.seed(1234)

The loss seems to be more stable now.

rafaelvalle commented 5 years ago

You can do something of this sort to create an audio file list that includes only files larger than 0.5M. find . -name "*.wav" -size +0.5M > audio_filelist.txt Change that size to the desired file size.

acrosson commented 5 years ago

I removed any audio clips that were sub one second. I'm still seeing instability during training. Spikes in the loss, and then crashes with nan

@rafaelvalle any other suggestions?

rafaelvalle commented 5 years ago

Check if you have files with lots of silence and remove them. You can also trim silence from the beginning and end of files.

puppyapple commented 4 years ago

@rafaelvalle hello, thanks first for your great work! I'm trying to train model with my custom dataset(48K sampling rate). I ensured that all my audio samples sizes are more than segment length(which I set to 16000), so no 'padding with zero' problem, and I also followed the advice that using std() to avoid zero sequence in https://github.com/NVIDIA/waveglow/issues/155 by @jongwook. But I still get NaN loss after several epoch. Any idea what else will cause this issue? Thanks

MuyangDu commented 4 years ago

@rafaelvalle hello, thanks first for your great work! I'm trying to train model with my custom dataset(48K sampling rate). I ensured that all my audio samples sizes are more than segment length(which I set to 16000), so no 'padding with zero' problem, and I also followed the advice that using std() to avoid zero sequence in #155 by @jongwook. But I still get NaN loss after several epoch. Any idea what else will cause this issue? Thanks

same here, I have used vad to remove all the silense part in the wav and make sure all the audio is longer than segment length. i have also using .std() to make sure there is no silense piece in the training data. I have also replace torch.logdet(W) with torch.det(W).abs().log(). none of the above helps. the loss became NaN when the loss became around -4.6. did you find a way to solve this?

rafaelvalle commented 4 years ago

For the ppl getting NaNs, please let us know if you're training in FP16 or FP32 and if you have NaNs in your log_s.

randombrein commented 3 years ago

Oh I see, I thought 16000 value was ms not samples. @julianzaidi can you share the modified code to ignore audios with less than 16000 segment_length? I did this dummy fix:

class Mel2Samp(torch.utils.data.Dataset):
    """
    This is the main class that calculates the spectrogram and returns the
    spectrogram, audio pair.
    """

    def __init__(self, training_files, segment_length, filter_length,
                 hop_length, win_length, sampling_rate, mel_fmin, mel_fmax):
        self.audio_files = files_to_list(training_files)
        for file in self.audio_files:
            audio_data, sample_r = load_wav_to_torch(file)
            if audio_data.size(0) < segment_length:
                self.audio_files.remove(file)
        random.seed(1234)

The loss seems to be more stable now.

You should avoid modifying lists while iterating, use a copy of your list instead; list(self.audio_files). Or better yet list comprehension;

# remove audio files less than segment_length
print("#audio_files={}".format(len(self)))
self.audio_files[:] = [f for f in self.audio_files if
                       load_wav_to_torch(f)[0].size(0) >= segment_length]
print("#audio_files after removal={}".format(len(self)))