Loss turns into 'nan' when cuda is True

Holmeyoung / crnn-pytorch

Pytorch implementation of CRNN (CNN + RNN + CTCLoss) for all language OCR.

MIT License

377 stars 105 forks source link

Loss turns into 'nan' when cuda is True #20

Closed niddal-imam closed 5 years ago

niddal-imam commented 5 years ago

Hi,

I have used train.py for many times, and I had no issues. However, now when I use train.py, loss is always nan if cuda is True. I think the problem is on my laptop, so any idea how to solve this issue. Thanks

Holmeyoung commented 5 years ago

Hi, have you set dealwith_lossnone = True in params.py. It's used to solve this problem.

niddal-imam commented 5 years ago

Yes, I did set 'dealwith_lossnone = True' , but I am still getting Loss nan.

Holmeyoung commented 5 years ago

Hi, can you try to commit the following in model/crnn.py

        # replace all nan/inf in gradients to zero
        if params.dealwith_lossnone:
            self.register_backward_hook(self.backward_hook)

and add this to train.py. NOTE THE LOCATION!

if params.cuda and torch.cuda.is_available():
    crnn.cuda()
    if params.multi_gpu:
        crnn = torch.nn.DataParallel(crnn, device_ids=range(params.ngpu))
    image = image.cuda()
    criterion = criterion.cuda()
image = Variable(image)
text = Variable(text)
length = Variable(length)

# new ----------------------------------------------------
if params.dealwith_lossnone:        
    crnn.register_backward_hook(crnn.backward_hook)
# new ----------------------------------------------------

# loss averager
loss_avg = utils.averager()

niddal-imam commented 5 years ago

Thank you for your help, but the loss still nan.

AnneYanggg commented 5 years ago

Hello, have you solved the problem yet? I met the same issue with you.

Holmeyoung commented 5 years ago

Hi, please try to replace the train.py and model/crnn.py with the file in the zip folder. There are many people meet the same problem.

https://github.com/pytorch/pytorch/issues/14401 https://github.com/pytorch/pytorch/issues/14335

mayfix.zip

AnneYanggg commented 5 years ago

Thanks for your quick reply. I have replaced the file in the zip folder, but the issue hasn't solved yet. I learned the code you provided in the zip folder, and noticed the register_backward_hook section you have added. Now, i think maybe there is something wrong with my training dataset. So i have a question how to judge a picture is dity or not.

Holmeyoung commented 5 years ago

Can you give me an example of the dity picture. We can give a solution together.

niddal-imam commented 5 years ago

Hi Holmeyoung,

I have replaced the files, but I am still getting loss nan. Although I have reinstalled CUDA but I still solve the problem. Thank you very much for your help. It is probably something wrong with my laptop.

AnneYanggg commented 5 years ago

@Holmeyoung I am sorry i haven't find the dirty picture yet. I have located a range. Thank you for your kindness. If i have any more questions, i will ask you.

Holmeyoung commented 5 years ago

I am sorry to hear that @niddal-imam . Many people meet this problem when using pytorch, so i am also confused what's the problem. But there actually exists a bug in pytorch ctcloss, anyway, i am really sorry can't solve it for you all.

Holmeyoung commented 5 years ago

Hi, please try the latest code, and i think i have fixed it, thank you~~~

SriramPingali commented 4 years ago

Hey @Holmeyoung .. I'm using the latest version of the repo and am still facing the issue.. Is there a way to correct it? loss