The ocr loss come to nan

GlowingHorse commented 5 years ago

I use my own Japanese dataset, and crop all words into single word images. Then I use train_ocr to train the ocr network, (using e2e-mltrctw.h5 as pretrained model, but change the output size of the model from 7500 to 4748 that is number of word types of my dataset). But the loss come to nan very fast. Is there some reason for that? Thanks!

These are training loss: 683464 training images in data/crop_train_images/crop_trainkuzushi.txt 683464 training images in data/crop_train_images/crop_trainkuzushi.txt 683464 training images in data/crop_train_images/crop_trainkuzushi.txt 683464 training images in data/crop_train_images/crop_trainkuzushi.txt epoch 0[0], loss: 55.214, lr: 0.00010 epoch 0[500], loss: 54.610, lr: 0.00010 epoch 0[1000], loss: 14.609, lr: 0.00010 epoch 0[1500], loss: 7.219, lr: 0.00010 epoch 0[2000], loss: 6.109, lr: 0.00010 epoch 0[2500], loss: 5.536, lr: 0.00010 epoch 0[3000], loss: 4.826, lr: 0.00010 epoch 0[3500], loss: 4.030, lr: 0.00010 epoch 0[4000], loss: 3.301, lr: 0.00010 epoch 0[4500], loss: nan, lr: 0.00010 epoch 1[5000], loss: nan, lr: 0.00010 save model: backup2/E2E_5000.h5 epoch 1[5500], loss: nan, lr: 0.00010 epoch 1[6000], loss: nan, lr: 0.00010 epoch 1[6500], loss: nan, lr: 0.00010 epoch 1[7000], loss: nan, lr: 0.00010 epoch 1[7500], loss: nan, lr: 0.00010 epoch 1[8000], loss: nan, lr: 0.00010 epoch 1[8500], loss: nan, lr: 0.00010 epoch 1[9000], loss: nan, lr: 0.00010 epoch 1[9500], loss: nan, lr: 0.00010

GlowingHorse commented 5 years ago

That's weird. When I increase the network output from 4748 to 4900. The network can be trained longer. For now, 'nan' has not appeared. I will report it tomorrow.

GlowingHorse commented 5 years ago

It seems has been solved. I just set the output channel to be bigger than your target number (don't just plus one, try to one hundred more than target number)

MichalBusta / E2E-MLT

The ocr loss come to nan #53