Closed niddal-imam closed 5 years ago
Hi, have you set dealwith_lossnone = True
in params.py
. It's used to solve this problem.
Yes, I did set 'dealwith_lossnone = True' , but I am still getting Loss nan.
Hi, can you try to commit the following in model/crnn.py
# replace all nan/inf in gradients to zero
if params.dealwith_lossnone:
self.register_backward_hook(self.backward_hook)
and add this to train.py
. NOTE THE LOCATION!
if params.cuda and torch.cuda.is_available():
crnn.cuda()
if params.multi_gpu:
crnn = torch.nn.DataParallel(crnn, device_ids=range(params.ngpu))
image = image.cuda()
criterion = criterion.cuda()
image = Variable(image)
text = Variable(text)
length = Variable(length)
# new ----------------------------------------------------
if params.dealwith_lossnone:
crnn.register_backward_hook(crnn.backward_hook)
# new ----------------------------------------------------
# loss averager
loss_avg = utils.averager()
Thank you for your help, but the loss still nan.
Hello, have you solved the problem yet? I met the same issue with you.
Hi, please try to replace the train.py
and model/crnn.py
with the file in the zip folder. There are many people meet the same problem.
https://github.com/pytorch/pytorch/issues/14401 https://github.com/pytorch/pytorch/issues/14335
Thanks for your quick reply. I have replaced the file in the zip folder, but the issue hasn't solved yet. I learned the code you provided in the zip folder, and noticed the register_backward_hook section you have added. Now, i think maybe there is something wrong with my training dataset. So i have a question how to judge a picture is dity or not.
Can you give me an example of the dity picture. We can give a solution together.
Hi Holmeyoung,
I have replaced the files, but I am still getting loss nan. Although I have reinstalled CUDA but I still solve the problem. Thank you very much for your help. It is probably something wrong with my laptop.
@Holmeyoung I am sorry i haven't find the dirty picture yet. I have located a range. Thank you for your kindness. If i have any more questions, i will ask you.
I am sorry to hear that @niddal-imam . Many people meet this problem when using pytorch, so i am also confused what's the problem. But there actually exists a bug in pytorch ctcloss, anyway, i am really sorry can't solve it for you all.
Hi, please try the latest code, and i think i have fixed it, thank you~~~
Hey @Holmeyoung .. I'm using the latest version of the repo and am still facing the issue.. Is there a way to correct it?
Hi,
I have used train.py for many times, and I had no issues. However, now when I use train.py, loss is always nan if cuda is True. I think the problem is on my laptop, so any idea how to solve this issue. Thanks