fatchord / WaveRNN

WaveRNN Vocoder + TTS
https://fatchord.github.io/model_outputs/
MIT License
2.13k stars 698 forks source link

Deterioration evaluation during training #87

Closed begeekmyfriend closed 5 years ago

begeekmyfriend commented 5 years ago

The enclosure is my test with fixed learning rate of 1e-4. We can see the evaluation deteriorate from 275k to 300k. It seemed the learning failed to converge. fixed_lr.zip

Having added some rough learning rate schedule as follows, this issue never happen again.

# Learning rate decay every 2000 epoches.
lr = learning_rate * (0.1 ** ((model.get_step() // total_iters + 1) // 2000))
for p in optimiser.param_groups: p['lr'] = lr

More elegant schedule methods can be found here. However 1e-3 seems too large for the initial rate though.

fatchord commented 5 years ago

@begeekmyfriend The reason there isn't a learning rate schedule is because the number of steps needed to train WaveRNN will vary wildly depending on whether it's 8bit/9bit or mixture of logistics output. For example an 8bit model will converge easily within 300,000 steps whereas mixture of logistics is closer to one million steps. If you change the sampling rate this will again change the number of training steps needed. So a one-size fits all approach is tricky.

My best advice for anyone training WaveRNN is to train at lr of 1e-4 for as long as it takes for the audio to sound good. Then spend a couple of hours converging the model witth lrs of 5e-5 and 1e-5. You can do this easily with python train_wavernn.py --lr 5e-5

As for the audio deteriorating in the samples you posted. This is just the model slipping into crappy local minima - in my experience it only takes an epoch or two to bounce of it and audio quality will drastically improve.

begeekmyfriend commented 5 years ago

Thank you for your response. Closing it. Please feel free to reopen this issue.

begeekmyfriend commented 5 years ago

Hey, I think I have found the reason that might lead to deterioration during training. It might be because the GTA mel sequence length does not match the quant ones. So I have written a script to check out this issue. Hope it useful for other people. It might does nothing with the learning rate.

May I reopen this issue for information?

import os
import numpy as np

hop = 275
for root, _, files in os.walk('gta'):
        for f in files:
                gta = os.path.join(root, f)
                quant = os.path.join('quant', f)
                len1 = np.load(gta).shape[1] * hop
                len2 = np.load(quant).shape[0]
                diff = np.abs(len2 - len1)
                if diff >= hop:
                        print(diff)
CorentinJ commented 5 years ago

I've found something on my own. It seems that stopping and restarting the training is the cause of the artifacts. Take a look at these samples. I've kept the training running until 91k steps, stopped to do something else on my GPU then restarted from 91k. You can hear artifacts in the samples generated at 94k steps that weren't there before. Additionally, I notice that the initial loss is higher that where it was left off when restarting the training. I think this is due to the fact that you do not save the optimizer state with the model. I think you should, I usually do it with Adam:

# Saving
torch.save({
    "model_state": model.state_dict(),
    "optimizer_state": optimizer.state_dict(),
}, path)

# Loading
model.load_state_dict(checkpoint["model_state"])
optimizer.load_state_dict(checkpoint["optimizer_state"])
CorentinJ commented 5 years ago

You're right @begeekmyfriend, that may actually be the cause of my problem too. I have varying differences of length between the mel spectrograms the audios. This happens whether I use the dsp.melspectrogram function from the repo or my own spectrograms generated by another implementation of tacotron. It also happens with both voc_mode = 'MOL' and voc_mode = 'RAW'. Is this something you expected and that the model already handles @fatchord?

In the meantime, I assume you can pad or trim the audio the match the spectrogram. librosa.stft is the function that adds the padding (by default, equally on both sides) to the signal before computing the spectrogram. I would either add a 0 padding at the start and end of the audio after computing the spectrogram, or either at the start or the end before computing the spectrogram, or simply trim the audio before computing the spectrogram.

edit: I can confirm the artifacts have disappeared after I made the fix

begeekmyfriend commented 5 years ago

In fact I am doing the preprocessing of GTA mel spectrograms and quantity labels in my own Tacotron code. I do not use preprocess.py provided by this project. And it ensures the length differences between GTA mel spectrograms and the quantity labels being limited within one hop length.

oytunturk commented 5 years ago

A good way to resolve mismatches in total samples vs total frames might be to pad each waveform with inaudible white noise of sufficient length.

begeekmyfriend commented 5 years ago

No, the thing is we have to ensure the locations of every phoneme of GTA mel spectrograms and ground truth wav clips are matched with each other. Padding might not be a good way. In fact T2 prediction can ensure that way in my observation through spectrograms tools such as Adobe Audition, because there is stop token prediction to force align with the target length.

oytunturk commented 5 years ago

@begeekmyfriend, I thought you were seeing a mismatch towards the end of each recording. If the mismatch is all along the waveform, it's definitely more serious. It would be great if you could share some samples/figures.

begeekmyfriend commented 5 years ago

Surely I am glad to share my samples. They are clips of ground truth audio files and files converted from GTA mel spectrograms. You can see each frames are almost aligned. gta_vs_gt_wav.zip

oytunturk commented 5 years ago

Yes, these are looking good. Do you have unaligned samples though, i.e. the ones which made you think that there is an alignment issue going on?

begeekmyfriend commented 5 years ago

On the 1st floor there are deteriorate evaluations with unalinged training samples which I have deleted all. You can use my script on the 4th floor to check whether there is any inconsistent length of sample or not.

oytunturk commented 5 years ago

Yes, thanks! That would be useful. I'd like to understand why misalignment occurs and fix it instead of skipping training samples. I'll share any insights I might have.

fatchord commented 5 years ago

@begeekmyfriend It doesn't matter if the legnths of the mel/sample sequences don't match up exactly since the raw samples are padded by two hop lengths before it enters WaveRNN. You can't misalign samples that are never seen by the model.