Open SeougnSeon opened 7 years ago
I add more train dataset with VOC2007 and VOC2012. The total loss also dose not converge.
I use ssd_300_vgg as pre-trained data(Fine-tuning existing SSD checkpoints) instead of vgg_16(Fine-tuning a network trained on ImageNet). I will try to train with this vgg_16. I think your work is very good to learn TF-slim.
I also have this problem. Total loss is above 6.0 for long time.
I got same problem with vgg_16(Fine-tuning a network trained on ImageNet) setting. The loss is bigger than 300_vgg.
I'm also having a problem converging with vgg_16. What mAP values did you achieve on evaluation?
My mAP is close to zero。What mAP values did you achieve?
I didn't test evaluation set after training. It was used for temporary to check the code is working.
The detected image is from ipynb code.
If the detected image is correct My next work is evaluation mAP.
I got the same problems of convergence. To help a bit, I used a fixed learning rate, but still the loss does not converge.
I got a map of 0.27 after a couple of days on a 4xGPU machine. Note that the reported results are over training of 2012+2007 (which I'm running at the moment)
I got the same problems of convergence, the global step is about 8000 but the loss is about 6 and mAP is 0.026. I have no idea about it.
I got the same problem, either. I really want to train from imagenet-pretrained model
@edocoh87 Did you train the ssd model using 4 GPUs?
Yes, 4 Titan X
@edocoh87 could you share your train script?
@edocoh87
DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords
TRAIN_DIR=./logs/vgg_300_0404
CHECKPOINT_PATH=./checkpoints/vgg_16.ckpt
python train_ssd_network.py \
--train_dir=${TRAIN_DIR} \
--dataset_dir=${DATASET_DIR} \
--dataset_name=pascalvoc_2007 \
--dataset_split_name=train \
--model_name=ssd_300_vgg \
--checkpoint_path=${CHECKPOINT_PATH} \
--checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
--checkpoint_model_scope=vgg_16 \
--save_summaries_secs=60 \
--save_interval_secs=600 \
--weight_decay=0.0005 \
--optimizer=rmsprop \
--learning_rate=0.005 \
--num_epochs_per_decay=10 \
--batch_size=32 \
--max_number_of_steps=200000 \
--num_clones=4
When I designated the num_clones=4
argument in the command script, I got this error.
InvalidArgumentError (see above for traceback): Cannot assign a device to node 'clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/Gather': Could not satisfy explicit device specification '/device:GPU:3' because no supported kernel for GPU devices is available.
[[Node: clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/Gather = Gather[Tindices=DT_INT32, Tparams=DT_INT32, validate_indices=true, _device="/device:GPU:3"](clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/Shape_1, clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/sub)]]
Could you let me know how to set multi-GPU ?
I also got the same problem. Has anyone solved it?
I have same problem. Anyone have solution?
I am currently experimenting on how to fix the training. I set up a special branch fix_training
.
A few things I have noticed until now:
trainable_scopes
for only training the new parts of the network. Then, in a second time, fine tune the full network.I also change the loss function to be copy completely the setting of SSD Caffe.
I kept training 80000 steps (fine tuning based on ssd_300_vgg.ckpt). And I found that although the loss is between 3.0 and 6.0 at the most of time, but the mAP is kept increasing. At the end, I achieved 70% mAP in VOC07 and 72% mAP in VOC12.
I wonder this training process just converge very slowly, while is correct.
@ithink2 Thanks for the testing.
I am working on fixing this training problem, aiming to get at least ~0.7 mAP starting from the VGG weights.
Things are getting a bit better (you can have a look at the fix_training
branch). I implement an hard mining which is equivalent to the SSD Caffe, and looking at how to improve the data augmentation part.
I got this training result with vgg-16(fine-tuning a network trained on ImageNet) on VOC07 for 4 days using 1GPU.
train script :
DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords
CHECKPOINT_PATH=./checkpoints/vgg_16.ckpt
python train_ssd_network.py \
--train_dir=${TRAIN_DIR} \
--dataset_dir=${DATASET_DIR} \
--dataset_name=pascalvoc_2007 \
--dataset_split_name=train \
--model_name=ssd_300_vgg \
--checkpoint_path=${CHECKPOINT_PATH} \
--checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
--checkpoint_model_scope=vgg_16 \
--save_summaries_secs=60 \
--save_interval_secs=600 \
--weight_decay=0.0005 \
--optimizer=rmsprop \
--learning_rate=0.001 \
--num_epochs_per_decay=200 \
--batch_size=32 \
--learning_rate_decay_factor=0.94
evaluation script :
TRAIN_DIR=/home/ywlee/SSD-Tensorflow/logs/vgg_300_0405/model.ckpt-468031
DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords
EVAL_DIR=${TRAIN_DIR}/eval
python eval_ssd_network.py \
--eval_dir=${EVAL_DIR} \
--dataset_dir=${DATASET_DIR} \
--dataset_name=pascalvoc_2007 \
--dataset_split_name=test \
--model_name=ssd_300_vgg \
--checkpoint_path=${TRAIN_DIR} \
--batch_size=1
But the mAPs are
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC07/mAP[0.00016903662625678594]
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC12/mAP[1.9445653552220332e-05]
I could't figure out why mAPs are so low.
Hello,@balancap! Could you mind to tell me how to restore the pretrained model in 'checkpoint_path', when the 'checkpoint_path' in your train_ssd_network.py seem to be unused after its declaration?
@SunAriesCN see line 378 in train_ssd_network.py
; in particular, see init_fn=tf_utils.get_init_fn(FLAGS)
. get_init_fn()
in line 186 of tf_utils.py
loads the latest checkpoint. There should also be an INFO
TF logging/print statement to sanity check that get_init_fn()
loaded the correct checkpoint.
@balancap Can you share your learning_rate
TensorBoard chart as well as total_loss
TensorBoard chart? I am experiencing similar behavior, not converging on training data even after 2+ days.
@villanuevab thank you, and I found this. But I just find others problem about ssd losses function , which the divisor under the smoothL1 and softmax loss seem not to be the number of matched default boxes as the paper's, and it's just a batch size.....Can someone tell me the reason about it?
@SeougnSeon @edocoh87 @balancap just checking to check the latest status on training using this codebase. I am going to try to go through the code and fix issues, but wanted to check here before i spend time.
I potentially found 1 bug, still not helping training though. The matching of anchor box with ground truth box has a bug: https://github.com/balancap/SSD-Tensorflow/blob/master/nets/ssd_common.py#L113-L114 @balancap why is it -0.5? Shouldnt line 113 be correct but you have commented it out. The matching strategy is also different from the paper, the paper ensures that each gt box has atleast 1 matched anchor, i couldnt find this in your code although I should still expect the loss to converge independent of this. Any thoughts welcome, in the meantime, I'll keep digging.
@siddharthm83 , Based on Paul's great code implementations, I made some changes, and was able to make the training process work to some degree.
The SSD model is initialized with VGG 16 weights trained on ImageNet. Training data is VOC 2007 and 2012 train_val, Testing data is VOC 2007 test, Final test accuracy is 0.65
If you are interested, you can see here for more details.
@LevinJ Good job! Could you please list what change have you make?
Sure @Zehaos . I listed the major changes I made in the Experimentation section of this link
@LevinJ Very clear! Thanks.
@LevinJ can you teach me how to train my own data(thousands of pictures but will detect only one object from them)。I follow balancap‘s fine-tuning method trained based on pretrained vgg-16,but the loss cannot conerge (also nearby 4.0).
### what’s the right way to train my own data and use it to detect my own object?
”./tfrecords/voc2007“ is the path I created with my own data(1920x1080)。 my train script is that:
DATASET_DIR=./tfrecords/voc2007
TRAIN_DIR=./logs/my_chkp
CHECKPOINT_PATH=./checkpoints/vgg_16.ckpt
python3.4 train_ssd_network.py \
--train_dir=${TRAIN_DIR} \
--dataset_dir=${DATASET_DIR} \
--dataset_name=pascalvoc_2007 \
--dataset_split_name=train \
--model_name=ssd_300_vgg \
--checkpoint_path=${CHECKPOINT_PATH} \
--checkpoint_model_scope=vgg_16 \
--checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
--trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
--save_summaries_secs=60 \
--save_interval_secs=600 \
--weight_decay=0.0005 \
--optimizer=adam \
--learning_rate=0.001 \
--learning_rate_decay_factor=0.94 \
--batch_size=64
Have you got any good solutions to the problem? @youngwanLEE
@ithink2 which python version did you use to run SSD-Tensorflow? python2 or python3?
@balancap which python version did you use to run SSD-Tensorflow? python2 or python3?
@balancap which python version did you use to run SSD-Tensorflow? python2 or python3? And which version tensorflow , cuda , cudnn did you use to run SSD-Tensorflow?
@ithink2 which python version did you use to run SSD-Tensorflow? python2 or python3? And which version tensorflow , cuda , cudnn did you use to run SSD-Tensorflow?
I got this training result with vgg-16(fine-tuning a network trained on ImageNet) on VOC07 for 4 days using 1GPU.
train script :
DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords CHECKPOINT_PATH=./checkpoints/vgg_16.ckpt python train_ssd_network.py \ --train_dir=${TRAIN_DIR} \ --dataset_dir=${DATASET_DIR} \ --dataset_name=pascalvoc_2007 \ --dataset_split_name=train \ --model_name=ssd_300_vgg \ --checkpoint_path=${CHECKPOINT_PATH} \ --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \ --checkpoint_model_scope=vgg_16 \ --save_summaries_secs=60 \ --save_interval_secs=600 \ --weight_decay=0.0005 \ --optimizer=rmsprop \ --learning_rate=0.001 \ --num_epochs_per_decay=200 \ --batch_size=32 \ --learning_rate_decay_factor=0.94
evaluation script :
TRAIN_DIR=/home/ywlee/SSD-Tensorflow/logs/vgg_300_0405/model.ckpt-468031 DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords EVAL_DIR=${TRAIN_DIR}/eval python eval_ssd_network.py \ --eval_dir=${EVAL_DIR} \ --dataset_dir=${DATASET_DIR} \ --dataset_name=pascalvoc_2007 \ --dataset_split_name=test \ --model_name=ssd_300_vgg \ --checkpoint_path=${TRAIN_DIR} \ --batch_size=1
But the mAPs are
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC07/mAP[0.00016903662625678594] I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC12/mAP[1.9445653552220332e-05]
I could't figure out why mAPs are so low.
@youngwanLEE Hallo, I'm wondering how did you achieved the convergence of the model? And why there's a sudden decrease of loss function at about 110k epochs?
Your work is very nice. I have a question about training. I trained voc_2007_train. I got a below total loss.
Total loss does not converge. When I use caffe version ssd the loss is easily converge.
Did you converge loss for voc_2000_train? Detection result with trained model.