experiencor / keras-yolo3

Training and Detecting Objects with YOLO3
MIT License
1.61k stars 861 forks source link

How to execute eagerly to find nan-loss cause? #306

Open fredrikorn opened 3 years ago

fredrikorn commented 3 years ago

Hi! I've been using this repo on my own dataset and I have encountered the problem with the loss suddenly hitting nan, even though it was converging nicely before (as in #198 ) After printing some things in the tensorflow graph I'm quite sure the error comes from weird values on box width and height, but I haven't managed to pinpoint it.

To check it I thought I'd try running the program eagerly with tf.compat.v1.enable_eager_execution() but it results in the error 'get_session' is not available when TensorFlow is executing eagerly.

Is it either possible to run it eagerly in some way or has anyone figured out the reason for the sudden nan-loss?

fredrikorn commented 3 years ago

If someone else runs into this issue, I found the nan-loss coming from the tf.sqrt gradient diverging close to zero (see this post ). I tackled this by adding a small epsilon value 1e-7 in dummy_loss in yolo.py.

Regarding the eager execution I haven't solved it