EdjeElectronics / TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10

How to train a TensorFlow Object Detection Classifier for multiple object detection on Windows
Apache License 2.0
2.92k stars 1.3k forks source link

Explosion in loss value #481

Closed rggs closed 4 years ago

rggs commented 4 years ago

When training my model, it runs normally for a while, but quite suddenly the loss will explode to values exceeding 1e19. Has anyone had this issue? When it happens, it will occasionally fall back down to ~.05, only to shoot back up. Here is an example of the console output:

INFO:tensorflow:global step 1300: loss = 6936561161601024.0000 (6.596 sec/step)
INFO:tensorflow:global step 1300: loss = 6936561161601024.0000 (6.596 sec/step)
INFO:tensorflow:global step 1301: loss = 0.0784 (6.244 sec/step)
INFO:tensorflow:global step 1301: loss = 0.0784 (6.244 sec/step)
INFO:tensorflow:global step 1302: loss = 0.0913 (6.704 sec/step)
INFO:tensorflow:global step 1302: loss = 0.0913 (6.704 sec/step)
INFO:tensorflow:global step 1303: loss = 3369519150006272.0000 (6.697 sec/step)
INFO:tensorflow:global step 1303: loss = 3369519150006272.0000 (6.697 sec/step)
INFO:tensorflow:global step 1304: loss = 0.0614 (6.560 sec/step)
INFO:tensorflow:global step 1304: loss = 0.0614 (6.560 sec/step)
INFO:tensorflow:global step 1305: loss = 0.0839 (6.677 sec/step)
INFO:tensorflow:global step 1305: loss = 0.0839 (6.677 sec/step)
INFO:tensorflow:global step 1306: loss = 0.0503 (6.977 sec/step)
INFO:tensorflow:global step 1306: loss = 0.0503 (6.977 sec/step)
INFO:tensorflow:global step 1307: loss = 9263617392246784.0000 (6.134 sec/step)
INFO:tensorflow:global step 1307: loss = 9263617392246784.0000 (6.134 sec/step)

What is going on here? I'm using the faster_rcnn_inception_v2_coco model. I should add that I've seen similar issues where the number of classes is wrong in the config file, but that doesn't seem to be the case here. As far as I can tell, my number of classes is correct.

satyamedh commented 4 years ago

same here

rggs commented 4 years ago

I'm putting this here because this fixed it for me: https://github.com/tensorflow/models/issues/8423#issuecomment-620188942

Tylersuard commented 4 years ago

You may have labels in your .pbtxt that differ from the labels in your .csv files. Try checking to make sure that the capitalization is the same for all the labels.