Open MichailChatzianastasis opened 2 years ago
A simple solution is to restart the job and load from the saved GHN checkpoint. I created a pull request https://github.com/facebookresearch/ppuda/pull/5, where I added the code to load the existing GHN checkpoint and resume training.
Let me know if this does not help. Otherwise, feel free to close the issue.
Hey, While i was training ghn and mlp models, at around 220 epochs, i had the following error: error <class 'RuntimeError'> the loss is nan, unable to proceed. Do you have any solution for this?
Error Message: error <class 'RuntimeError'> the loss is nan, unable to proceed error <class 'RuntimeError'> the loss is nan, unable to proceed error <class 'RuntimeError'> the loss is nan, unable to proceed error <class 'RuntimeError'> the loss is nan, unable to proceed error <class 'RuntimeError'> the loss is nan, unable to proceed error <class 'RuntimeError'> the loss is nan, unable to proceed Out of patience (after 15 attempts to continue), please restart the job with another seed !!! Traceback (most recent call last): File "/ppuda/experiments/train_ghn.py", line 168, in
main()
File "/ppuda/experiments/train_ghn.py", line 105, in main
loss = trainer.update(nets_torch, images, targets, ghn=ghn, graphs=graphs)
File "/ppuda/../ppuda/ppuda/utils/trainer.py", line 101, in update
raise RuntimeError('the loss is {}, unable to proceed'.format(loss))
RuntimeError: the loss is nan, unable to proceed