Hi @evonneng ,
Sometimes the training of vq-gan stop in midway and I have to restart it due to some technical issue with our server. When I restart the training from, the checkpoint training loss goes haywire as shown on this green training loss grap from my previous issue
Have you come across this issue? I was wondering whether one should also save the loss in checkpoint config and load them when starting again?
checkpoint = {'config': args.config,
'state_dict': generator.state_dict(),
'optimizer': {
'optimizer': g_optimizer._optimizer.state_dict(),
'n_steps': g_optimizer.n_steps,
},
'epoch': epoch}
Hi! Thanks for pointing this out! Yes, this is a common behaviour that I saw as well. Saving the loss checkpoint and reloading when you are retraining does help with problem.
Hi @evonneng , Sometimes the training of vq-gan stop in midway and I have to restart it due to some technical issue with our server. When I restart the training from, the checkpoint training loss goes haywire as shown on this green training loss grap from my previous issue
Have you come across this issue? I was wondering whether one should also save the loss in checkpoint config and load them when starting again? checkpoint = {'config': args.config, 'state_dict': generator.state_dict(), 'optimizer': { 'optimizer': g_optimizer._optimizer.state_dict(), 'n_steps': g_optimizer.n_steps, }, 'epoch': epoch}