DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.4k stars 158 forks source link

raining Issue: Reconstruction Loss Resets After Some Epochs #170

Closed lpscr closed 3 months ago

lpscr commented 3 months ago

Hi,

Thank you very much for the great report. I loaded a small dataset with 12 hours and 7670 datapoints. I tried to load more, but I ran out of memory as mentioned here: (https://github.com/DigitalPhonetics/IMS-Toucan/issues/169#issue-2354213588)

So, I loaded a smaller dataset for testing 12 hours about 7600 datapoinrts . I want to train from scratch. The training started and was going very well in the beginning, learning very fast. After some epochs, I had a reconstruction loss of about 0.4083 at step 7887. However, when moving to the next step, the reconstruction loss suddenly increased to a very high value of 16.00, as if the training started from the beginning again. The reconstruction loss then became 16.01.

I also tested the best.pt to see how it sounds, and it was very wrong—only noise and not working. Does this mean it is now learning from the beginning again? Is this normal? I need to understand what is happening, please. help i continue to train ? i see gain the loss going down very slow... now after 30 min its still 7.0 :( very bad quality i leave to see what happend but please let me know if this normal to know

Here is also a picture for reference:

sss

the train i use

    train_loop(net=model,
           datasets=[train_data],
               device=device,
               save_directory=save_dir,
               batch_size=32,  
               eval_lang="eng",
               warmup_steps=4000,
               lr=1e-3,  
               fine_tune=False,
               resume=resume,
               steps=1000000,
               use_wandb=False,
               train_samplers=[torch.utils.data.RandomSampler(train_data)],
               gpu_count=0)
)
lpscr commented 3 months ago

The training seems to be working now. After one day, I have a loss of about 0.32 over 40k steps. I guess this is normal for the first time this happens and loss going high and then slow down again . Thank you very much for the amazing report; it looks very cool. Now, I just need to know if it is possible to train on a larger dataset with multiple speakers. Also, please let me know if this is the correct way to train from scratch to ensure I am doing everything right. I want to train on a big dataset, but I am running out of memory. Any solution would be greatly appreciated.

Flux9665 commented 3 months ago

You have 4000 warmup steps in your config, which means the learning rate is increasing slowly for the first 4000 steps, so the model does not make large changes based on its random initialization. In the same spirit, there is a second warmup period, but it's not explicitly written in the config, because it's just 2warmup_steps. Before this, the model directly predicts the spectrogram, so the reconstruction loss is the only one that matters. After this second warmup period, a second decoder comes into play, which is built as a normalizing flow. This is much better at handling small details. And when this second decoder takes over, the loss will increase and the spectrograms look bad for a while, because this better decoder is only starting at 2warmup_steps. So this is not unexpected, all good. In the next version, the normalizing flow will be replaces with a conditional flow matching model, which is even better at the fine details, but this will take a bit more time.