I am using the Tensorflow 0.12 version, on the augmented dataset. The loss always skyrockets to Inf and then NaN, even when I set the learning rate to zero (1e-15). I even test with the "debug.txt" file and the misc folder images and get the following output for the first 5 steps of train.py, using deeplab_resnet.ckpt:
step 0 loss = 1.664, (3.926 sec/step)
step 1 loss = 3494010119258112.000, (0.555 sec/step)
step 2 loss = 7812047707534524416.000, (0.256 sec/step)
step 3 loss = 22405511500261228544.000, (0.255 sec/step)
step 4 loss = 35339726484766982144.000, (0.256 sec/step)
Running fine_tune.py returns similar results.
deeplab_resnet_init.ckpt behaves exactly the same:
step 0 loss = 4.739, (3.675 sec/step)
step 1 loss = 14367365719148986368.000, (0.560 sec/step)
step 2 loss = 43311187985166237696.000, (0.256 sec/step)
step 3 loss = 72587321148004368384.000, (0.260 sec/step)
step 4 loss = 94695676436719599616.000, (0.262 sec/step)
My batch size is 1 to avoid the OOM error.
Any ideas what is causing this mystery? I am sure there is nothing wrong with the dataset, as the same happens with the debug images.
I am using the Tensorflow 0.12 version, on the augmented dataset. The loss always skyrockets to Inf and then NaN, even when I set the learning rate to zero (1e-15). I even test with the "debug.txt" file and the misc folder images and get the following output for the first 5 steps of train.py, using deeplab_resnet.ckpt:
step 0 loss = 1.664, (3.926 sec/step) step 1 loss = 3494010119258112.000, (0.555 sec/step) step 2 loss = 7812047707534524416.000, (0.256 sec/step) step 3 loss = 22405511500261228544.000, (0.255 sec/step) step 4 loss = 35339726484766982144.000, (0.256 sec/step)
Running fine_tune.py returns similar results. deeplab_resnet_init.ckpt behaves exactly the same:
step 0 loss = 4.739, (3.675 sec/step) step 1 loss = 14367365719148986368.000, (0.560 sec/step) step 2 loss = 43311187985166237696.000, (0.256 sec/step) step 3 loss = 72587321148004368384.000, (0.260 sec/step) step 4 loss = 94695676436719599616.000, (0.262 sec/step)
My batch size is 1 to avoid the OOM error.
Any ideas what is causing this mystery? I am sure there is nothing wrong with the dataset, as the same happens with the debug images.