facebookresearch / ppuda

Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)
MIT License
485 stars 60 forks source link

Runtime Error: Loss is nan #4

Open MichailChatzianastasis opened 2 years ago

MichailChatzianastasis commented 2 years ago

Hey, While i was training ghn and mlp models, at around 220 epochs, i had the following error: error <class 'RuntimeError'> the loss is nan, unable to proceed. Do you have any solution for this?

Error Message: error <class 'RuntimeError'> the loss is nan, unable to proceed error <class 'RuntimeError'> the loss is nan, unable to proceed error <class 'RuntimeError'> the loss is nan, unable to proceed error <class 'RuntimeError'> the loss is nan, unable to proceed error <class 'RuntimeError'> the loss is nan, unable to proceed error <class 'RuntimeError'> the loss is nan, unable to proceed Out of patience (after 15 attempts to continue), please restart the job with another seed !!! Traceback (most recent call last): File "/ppuda/experiments/train_ghn.py", line 168, in main() File "/ppuda/experiments/train_ghn.py", line 105, in main loss = trainer.update(nets_torch, images, targets, ghn=ghn, graphs=graphs) File "/ppuda/../ppuda/ppuda/utils/trainer.py", line 101, in update raise RuntimeError('the loss is {}, unable to proceed'.format(loss)) RuntimeError: the loss is nan, unable to proceed

bknyaz commented 2 years ago

A simple solution is to restart the job and load from the saved GHN checkpoint. I created a pull request https://github.com/facebookresearch/ppuda/pull/5, where I added the code to load the existing GHN checkpoint and resume training.

Let me know if this does not help. Otherwise, feel free to close the issue.