evonneng / learning2listen

Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)
107 stars 9 forks source link

Restarting Vqgan training from checkpoint break the training loss #9

Closed Daksitha closed 1 year ago

Daksitha commented 1 year ago

Hi @evonneng , Sometimes the training of vq-gan stop in midway and I have to restart it due to some technical issue with our server. When I restart the training from, the checkpoint training loss goes haywire as shown on this green training loss grap from my previous issue

Have you come across this issue? I was wondering whether one should also save the loss in checkpoint config and load them when starting again? checkpoint = {'config': args.config, 'state_dict': generator.state_dict(), 'optimizer': { 'optimizer': g_optimizer._optimizer.state_dict(), 'n_steps': g_optimizer.n_steps, }, 'epoch': epoch}

evonneng commented 1 year ago

Hi! Thanks for pointing this out! Yes, this is a common behaviour that I saw as well. Saving the loss checkpoint and reloading when you are retraining does help with problem.