DrSleep / tensorflow-deeplab-resnet

DeepLab-ResNet rebuilt in TensorFlow
MIT License
1.25k stars 429 forks source link

Loss skyrockets until NaN even with almost-zero learning rate. #180

Closed ghost closed 6 years ago

ghost commented 6 years ago

I am using the Tensorflow 0.12 version, on the augmented dataset. The loss always skyrockets to Inf and then NaN, even when I set the learning rate to zero (1e-15). I even test with the "debug.txt" file and the misc folder images and get the following output for the first 5 steps of train.py, using deeplab_resnet.ckpt:

step 0 loss = 1.664, (3.926 sec/step) step 1 loss = 3494010119258112.000, (0.555 sec/step) step 2 loss = 7812047707534524416.000, (0.256 sec/step) step 3 loss = 22405511500261228544.000, (0.255 sec/step) step 4 loss = 35339726484766982144.000, (0.256 sec/step)

Running fine_tune.py returns similar results. deeplab_resnet_init.ckpt behaves exactly the same:

step 0 loss = 4.739, (3.675 sec/step) step 1 loss = 14367365719148986368.000, (0.560 sec/step) step 2 loss = 43311187985166237696.000, (0.256 sec/step) step 3 loss = 72587321148004368384.000, (0.260 sec/step) step 4 loss = 94695676436719599616.000, (0.262 sec/step)

My batch size is 1 to avoid the OOM error.

Any ideas what is causing this mystery? I am sure there is nothing wrong with the dataset, as the same happens with the debug images.

ghost commented 6 years ago

Solved this by setting os.environ["CUDA_VISIBLE_DEVICES"]="0"