chuanqi305 / MobileNetv2-SSDLite

Caffe implementation of SSD and SSDLite detection on MobileNetv2, converted from tensorflow.
MIT License
446 stars 231 forks source link

training loss doesn't decrease after several hundreds steps #31

Open Chanbluky opened 6 years ago

Chanbluky commented 6 years ago

thanks for your great work! i am using your train.prototxt to train my dataset and finetunn with your deploy_vox.caffemodel, but loss decreases from 16 to 11 with several steps, but then it doesn't decrease anymore. Then I use your train.prototxt and deploy_voc.caffemodel to continue training the network by using VOC2007 and VOC2012 dataset, the original loss is 7.2, but suddenly changed to 13 after 10 steps and then keep around 11. so the problem happened on my own dataset also appear on your dataset.

could you help explain why? thanks! by the way, the inference is good when using you deploy.prototxt and deploy_voc.caffemodel.

the following is the log from training VOC dataset by using your train.prototxt and deploy_voc.caffemodel I0913 22:55:11.996824 72654 solver.cpp:259] Train net output #0: mbox_loss = 7.28595 ( 1 = 7.28595 loss) I0913 22:55:11.996837 72654 sgd_solver.cpp:138] Iteration 0, lr = 1e-05 I0913 22:55:47.544627 72654 solver.cpp:243] Iteration 10, loss = 13.3957 I0913 22:55:47.544867 72654 solver.cpp:259] Train net output #0: mbox_loss = 13.1836 ( 1 = 13.1836 loss) I0913 22:55:47.544908 72654 sgd_solver.cpp:138] Iteration 10, lr = 1e-05 I0913 22:56:21.804133 72654 solver.cpp:243] Iteration 20, loss = 13.1605 I0913 22:56:21.804299 72654 solver.cpp:259] Train net output #0: mbox_loss = 13.5674 ( 1 = 13.5674 loss) I0913 22:56:21.804311 72654 sgd_solver.cpp:138] Iteration 20, lr = 1e-05 I0913 22:56:55.918035 72654 solver.cpp:243] Iteration 30, loss = 13.1157 I0913 22:56:55.918321 72654 solver.cpp:259] Train net output #0: mbox_loss = 13.2727 ( 1 = 13.2727 loss)

Chanbluky commented 6 years ago

and i also used the cudnn_conv_layers.cpp/cu that you provided to accelerate the training.

Chanbluky commented 6 years ago

i replaced these two file with the original files in caffe, now the issue was gone. i am not familiar with cuda, i am using Tesla M40 with 24GB memory for training, could you tell why i can't use your files to accelerate the training?

zyc4me commented 6 years ago

@Chanbluky HI,do you have the deploy_voc.caffemodel? Could you please share it to me?

Chanbluky commented 6 years ago

@Chanbluky HI,do you have the deploy_voc.caffemodel? Could you please share it to me?

sure! please provide your email.

zyc4me commented 6 years ago

@Chanbluky 252400108@qq.com, Thank you!

saxg commented 5 years ago

The loss in my experiments keep around 8 and I can not get the pre_trained model mobilenetv2_ssdlite_coco, could someone share it with me. I train mobilenet_ssd on the same dataset and the loss converge well.

weilanShi commented 4 years ago

i replaced these two file with the original files in caffe, now the issue was gone. i am not familiar with cuda, i am using Tesla M40 with 24GB memory for training, could you tell why i can't use your files to accelerate the training?

Do you add the cudnn conv layer.cpp/cu in the original Caffe to replace the caffe-ssd , is the training speed faster?

weilanShi commented 4 years ago

thanks for your great work! i am using your train.prototxt to train my dataset and finetunn with your deploy_vox.caffemodel, but loss decreases from 16 to 11 with several steps, but then it doesn't decrease anymore. Then I use your train.prototxt and deploy_voc.caffemodel to continue training the network by using VOC2007 and VOC2012 dataset, the original loss is 7.2, but suddenly changed to 13 after 10 steps and then keep around 11. so the problem happened on my own dataset also appear on your dataset.

could you help explain why? thanks! by the way, the inference is good when using you deploy.prototxt and deploy_voc.caffemodel.

the following is the log from training VOC dataset by using your train.prototxt and deploy_voc.caffemodel I0913 22:55:11.996824 72654 solver.cpp:259] Train net output #0: mbox_loss = 7.28595 ( 1 = 7.28595 loss) I0913 22:55:11.996837 72654 sgd_solver.cpp:138] Iteration 0, lr = 1e-05 I0913 22:55:47.544627 72654 solver.cpp:243] Iteration 10, loss = 13.3957 I0913 22:55:47.544867 72654 solver.cpp:259] Train net output #0: mbox_loss = 13.1836 ( 1 = 13.1836 loss) I0913 22:55:47.544908 72654 sgd_solver.cpp:138] Iteration 10, lr = 1e-05 I0913 22:56:21.804133 72654 solver.cpp:243] Iteration 20, loss = 13.1605 I0913 22:56:21.804299 72654 solver.cpp:259] Train net output #0: mbox_loss = 13.5674 ( 1 = 13.5674 loss) I0913 22:56:21.804311 72654 sgd_solver.cpp:138] Iteration 20, lr = 1e-05 I0913 22:56:55.918035 72654 solver.cpp:243] Iteration 30, loss = 13.1157 I0913 22:56:55.918321 72654 solver.cpp:259] Train net output #0: mbox_loss = 13.2727 ( 1 = 13.2727 loss)

The training loss decreases very quickly when the coco.caffemodel is transformed into a training model of its own datasets