Holmeyoung / crnn-pytorch

Pytorch implementation of CRNN (CNN + RNN + CTCLoss) for all language OCR.
MIT License
377 stars 105 forks source link

loss is inf #21

Closed SreenijaK closed 5 years ago

SreenijaK commented 5 years ago

why is it that my loss is always like below: | [806/1000][400/410] Loss: inf 0|train | [807/1000][100/410] Loss: inf 0|train | [807/1000][200/410] Loss: inf 0|train | [807/1000][300/410] Loss: inf 0|train | [807/1000][400/410] Loss: inf 0|train | [808/1000][100/410] Loss: inf 0|train | [808/1000][200/410] Loss: inf 0|train | [808/1000][300/410] Loss: inf 0|train | [808/1000][400/410] Loss: inf 0|train | [809/1000][100/410] Loss: inf 0|train | [809/1000][200/410] Loss: inf 0|train | [809/1000][300/410] Loss: inf 0|train | [809/1000][400/410] Loss: inf 0|train | [810/1000][100/410] Loss: inf 0|train | [810/1000][200/410] Loss: inf

SreenijaK commented 5 years ago

i was able to rectify the issue, incase the length of train samples is over 26 i was getting the above issue.

I Its reached till 1000, but has not saved any model in expr, do you by chance know the reason?

SreenijaK commented 5 years ago

I havent changed anythin in params.py, But i still dont understand the reason : heres snipped of my params.py nc = 1 pretrained = '' # path to pretrained model (to continue training) expr_dir = 'expr' # where to store samples and models dealwith_lossnone = True # whether to replace all nan/inf in gradients to zero

cuda = True # enables cuda multi_gpu = False # whether to use multi gpu ngpu = 1 # number of GPUs to use. Do remember to set multi_gpu to True! workers = 0 # number of data loading workers

training process

displayInterval = 100 # interval to be print the train loss valInterval = 1000 # interval to val the model loss and accuray saveInterval = 1000 # interval to save model n_test_disp = 10 # number of samples to display when val the model

Holmeyoung commented 5 years ago

Hi, sorry to reply now. You can check the code

            # do checkpointing
            if i % params.saveInterval == 0:
                torch.save(crnn.state_dict(), '{0}/netCRNN_{1}_{2}.pth'.format(params.expr_dir, epoch, i))

and i < len(train_loader) and len(train_loader) is related to batch_size

Of course, we need't make it so difficult, just change the params.saveInterval to smaller is OK.

SreenijaK commented 5 years ago

Thank you. its working now anyways. you've been a great help.

Holmeyoung commented 5 years ago

Hi, please try the latest code, and i think i have fixed it, thank you~~~