Restarting Vqgan training from checkpoint break the training loss

evonneng / learning2listen

Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

107 stars 9 forks source link

Hi @evonneng , Sometimes the training of vq-gan stop in midway and I have to restart it due to some technical issue with our server. When I restart the training from, the checkpoint training loss goes haywire as shown on this green training loss grap from my previous issue

Have you come across this issue? I was wondering whether one should also save the loss in checkpoint config and load them when starting again? checkpoint = {'config': args.config, 'state_dict': generator.state_dict(), 'optimizer': { 'optimizer': g_optimizer._optimizer.state_dict(), 'n_steps': g_optimizer.n_steps, }, 'epoch': epoch}

evonneng / learning2listen

Restarting Vqgan training from checkpoint break the training loss #9