Training loss - Githubissues

maolin23 commented 8 years ago

Hi,

Could you tell me about the loss during training? When I use your code to train front-end model, my loss is about 2、3 in the first 15 iteration. After that, my loss increase to 50 ~ 80 and be stabled. After 20K iterations, it is still about 60~80. I'm not sure the correctness of this situation.... Could you tell me this situation is normal or not? What loss is correct? (My training/testing input images are all original and I don't change anything in the train.py.)

Thanks a lot, Mao

lhao0301 commented 8 years ago

@maolin23 Have you solved the problem. I have also met this problem and I have tried some different batch_size and iter_size. Sometimes the loss changes as you say and sometimes it changes normally. More specifically, when iter_size is 1, the loss changes normally of high probability and when iter_size is larger, loss always changes abnormally.

fyu commented 8 years ago

The loss of initial stages should be around 3.0 for a 19 category classification problem. If you observe something much bigger than that, it probably indicates the optimization diverges. It is hard to diagnose the exact problems without more information. But if you are using the parameters and datasets described in the dilation paper, it is unlikely to happen.

lhao0301 commented 8 years ago

@fyu I train the frontend net on vgg_conv.caffemodel for initialization and only change batch_size to 8 for my limited GPU. It still diverges sometimes.

jgong5 commented 7 years ago

I got the same problem with batch size 8 but better with batch size 7. Why is there big difference with subtle batch size change?

fyu commented 7 years ago

I just added an option to set iter_size in the training options: https://github.com/fyu/dilation/blob/master/train.py#L233. If your GPU doesn't have enough memory and you have to decrease the batch size, you can try to increase iter_size.

austingg commented 7 years ago

@maolin23 @jgong5 @fyu have you solved the problem? I also met this problem, after I changed the finaly layer with 'xavier' initialization, the loss seems better. But I have not finished my training process.

jgong5 commented 7 years ago

No. I gave up eventually and turned to Berkeley’s FCN. You mean your change can get the network converge eventually?

From: Yubin Wang [mailto:notifications@github.com] Sent: Wednesday, March 01, 2017 6:12 PM To: fyu/dilation dilation@noreply.github.com Cc: Gong, Jiong jiong.gong@intel.com; Mention mention@noreply.github.com Subject: Re: [fyu/dilation] Training loss (#12)

@maolin23https://github.com/maolin23 @jgong5https://github.com/jgong5 @fyuhttps://github.com/fyu have you solved the problem? I also met this problem, after I changed the finaly layer with 'xavier' initialization, the loss seems better. But I have not finished my training process.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fyu/dilation/issues/12#issuecomment-283299426, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH-NN-Otle7-vbYsxnNFkyXxNRBmuUwvks5rhUR3gaJpZM4JSmym.

austingg commented 7 years ago

@jgong5 Unfortunately, the loss becomes bigger after 200 iteration with batchsize 6.

austingg commented 7 years ago

@jgong5 @fyu After I initialized the net with vgg_conv and initialization weights with xavier, the training loss seems better, and going down as the iteration increase, after 3w iteration, the training loss is about 10^-6. However, the test accuracy is always -nan, and the test results are all black. I train with my custom dataset.

huangh12 commented 7 years ago

@maolin23 @TX2012LH @jgong5 @austingg same problem as you guys. I just train the net in VOC07(less than 500 images), it's quite weired the net fail to converge since the dataset is so small... However, it seems that the author stop offering supports now.

ice-pice commented 7 years ago

Hi @fyu I have run the training of VGG front-end model based on the documentation you have provided. However, the loss appears to be diverging very soon as you can see in this log. I have cross-verified the hyper-parameters you have mentioned in the paper against the ones written in the documentation and they seem to be matching. Same divergence issue can be seen with the joint training.

I am running your code using cuda-8.0 and cudnn-5. Can you kindly run your demo from scratch and tell us where the issue might be? A lot of people here seem to be facing the same issue.

Thanks!

xingbotao commented 7 years ago

Is label one channel or RGB channel?

fyu commented 7 years ago

one channel

HXKwindwizard commented 7 years ago

Hi, @fyu . Thank you for yours excellent codes. I met a problem that, when I use the trained models (the loss near 2) and the test_net.txt (frontend or joint ) to do the prediction for a figure, the resulting figure is always black and there is nothing on this figure.
Is there anything I need to do before the prediction? Thanks ahead

fyu commented 7 years ago

@HXKwindwizard If loss is 2, it is a bad sign, saying that the model is not working properly. Probably your data is too different from what the model was trained on. It may solve the problem to train the model on your data.

HXKwindwizard commented 7 years ago

@fyu thanks for your reminding. I use the pascal voc datasets and funtune based on vgg that you suggested. I have done several trainings based on this data. When the loss is sometimes arount 10, the situation I mentioned above still exists. So I wonder, evenif the traing is not good, the prediction resulting figure can not be always black. Yet I use your trained model to do the prediciton, the reult is quite good. Is there any relationship with the network structure ? (I use the test.net to serve as the prototxt).

fyu / dilation

Training loss #12