Issue with training FCN_Vgg16_32s

interactivetech commented 6 years ago

Hi there.

Nice job on the repo! I am trying to train the FCN_Vgg16_32s model, and I am having issue with the loss being "nan".

I followed the steps to set up the VOC dataset and transfer VGG weights for the FCN model.

The only code changes I made was in train.py was setting model_name='FCN_Vgg16_32s' and the loss after a few steps, results to nan.

Here is a screenshot of my terminal output.

Please advise on how I should go about debugging the loss function, and if you need additional information.

interactivetech commented 6 years ago

I studied the loss during training, and from 4-9 steps, the loss increased rapidly to a large number until reaching nan. I had a feeling this was due to a bad learning rate, so I investigated what was the learning rate other repos set to train FCN_VGG16-32s. It seems that people use 1e-10.

Now the learning rate is behaving appropriately.

Here are some links of additional details where people set the learning rate to 1e-10:

Pytorch Repo: https://github.com/wkentaro/pytorch-fcn/blob/cfad9e594e2bb5a327ba838a7103abeb74e190c8/torchfcn/ext/fcn.berkeleyvision.org/siftflow-fcn32s/solver.prototxt#L10

Pytorch Repo: https://github.com/ZijunDeng/pytorch-semantic-segmentation/blob/4a1721f9a3284788336430efb140288096c6dd09/train/voc-fcn/train.py#L27

interactivetech commented 6 years ago

Update, 1e-10 as a learning rate was not working as well as I hoped it would. I am getting the best training results setting 'base_lr=1e-5' and weight_decay='1e-5'

aurora95 / Keras-FCN

Issue with training FCN_Vgg16_32s #60