training voc2007 - Githubissues

SeougnSeon commented 7 years ago

Your work is very nice. I have a question about training. I trained voc_2007_train. I got a below total loss.

Total loss does not converge. When I use caffe version ssd the loss is easily converge.

Did you converge loss for voc_2000_train? Detection result with trained model.

SeougnSeon commented 7 years ago

I add more train dataset with VOC2007 and VOC2012. The total loss also dose not converge.

I use ssd_300_vgg as pre-trained data(Fine-tuning existing SSD checkpoints) instead of vgg_16(Fine-tuning a network trained on ImageNet). I will try to train with this vgg_16. I think your work is very good to learn TF-slim.

chenweiqian commented 7 years ago

I also have this problem. Total loss is above 6.0 for long time.

SeougnSeon commented 7 years ago

I got same problem with vgg_16(Fine-tuning a network trained on ImageNet) setting. The loss is bigger than 300_vgg.

edocoh87 commented 7 years ago

I'm also having a problem converging with vgg_16. What mAP values did you achieve on evaluation?

chenweiqian commented 7 years ago

My mAP is close to zero。What mAP values did you achieve？

SeougnSeon commented 7 years ago

I didn't test evaluation set after training. It was used for temporary to check the code is working.

The detected image is from ipynb code.

If the detected image is correct My next work is evaluation mAP.

christopher5106 commented 7 years ago

I got the same problems of convergence. To help a bit, I used a fixed learning rate, but still the loss does not converge.

edocoh87 commented 7 years ago

I got a map of 0.27 after a couple of days on a 4xGPU machine. Note that the reported results are over training of 2012+2007 (which I'm running at the moment)

zhyhan commented 7 years ago

I got the same problems of convergence, the global step is about 8000 but the loss is about 6 and mAP is 0.026. I have no idea about it.

youngwanLEE commented 7 years ago

I got the same problem, either. I really want to train from imagenet-pretrained model

youngwanLEE commented 7 years ago

@edocoh87 Did you train the ssd model using 4 GPUs?

edocoh87 commented 7 years ago

Yes, 4 Titan X

youngwanLEE commented 7 years ago

@edocoh87 could you share your train script?

youngwanLEE commented 7 years ago

@edocoh87

DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords
TRAIN_DIR=./logs/vgg_300_0404
CHECKPOINT_PATH=./checkpoints/vgg_16.ckpt
python train_ssd_network.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${CHECKPOINT_PATH} \
    --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
    --checkpoint_model_scope=vgg_16 \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --weight_decay=0.0005 \
    --optimizer=rmsprop \
    --learning_rate=0.005 \
    --num_epochs_per_decay=10 \
    --batch_size=32 \
    --max_number_of_steps=200000 \
    --num_clones=4

When I designated the num_clones=4 argument in the command script, I got this error.

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/Gather': Could not satisfy explicit device specification '/device:GPU:3' because no supported kernel for GPU devices is available.
         [[Node: clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/Gather = Gather[Tindices=DT_INT32, Tparams=DT_INT32, validate_indices=true, _device="/device:GPU:3"](clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/Shape_1, clone_3/ssd_losses/block_5/cross_entropy_neg/SparseSoftmaxCrossEntropyWithLogits/sub)]]

Could you let me know how to set multi-GPU ?

paolutan commented 7 years ago

I also got the same problem. Has anyone solved it?

ggookey123 commented 7 years ago

I have same problem. Anyone have solution?

balancap commented 7 years ago

I am currently experimenting on how to fix the training. I set up a special branch fix_training. A few things I have noticed until now:

use a very simple data pre-processing pipeline at beginning to test out the training script;
use the argument trainable_scopes for only training the new parts of the network. Then, in a second time, fine tune the full network.

I also change the loss function to be copy completely the setting of SSD Caffe.

paolutan commented 7 years ago

I kept training 80000 steps (fine tuning based on ssd_300_vgg.ckpt). And I found that although the loss is between 3.0 and 6.0 at the most of time, but the mAP is kept increasing. At the end, I achieved 70% mAP in VOC07 and 72% mAP in VOC12.

I wonder this training process just converge very slowly, while is correct.

balancap commented 7 years ago

@ithink2 Thanks for the testing. I am working on fixing this training problem, aiming to get at least ~0.7 mAP starting from the VGG weights. Things are getting a bit better (you can have a look at the fix_training branch). I implement an hard mining which is equivalent to the SSD Caffe, and looking at how to improve the data augmentation part.

youngwanLEE commented 7 years ago

I got this training result with vgg-16(fine-tuning a network trained on ImageNet) on VOC07 for 4 days using 1GPU.

train script :

DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords
CHECKPOINT_PATH=./checkpoints/vgg_16.ckpt
python train_ssd_network.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${CHECKPOINT_PATH} \
    --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
    --checkpoint_model_scope=vgg_16 \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --weight_decay=0.0005 \
    --optimizer=rmsprop \
    --learning_rate=0.001 \
    --num_epochs_per_decay=200 \
    --batch_size=32 \
    --learning_rate_decay_factor=0.94

evaluation script :

TRAIN_DIR=/home/ywlee/SSD-Tensorflow/logs/vgg_300_0405/model.ckpt-468031
DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords
EVAL_DIR=${TRAIN_DIR}/eval
python eval_ssd_network.py \
    --eval_dir=${EVAL_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=test \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${TRAIN_DIR} \
    --batch_size=1

But the mAPs are

I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC07/mAP[0.00016903662625678594]
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC12/mAP[1.9445653552220332e-05]

I could't figure out why mAPs are so low.

SunAriesCN commented 7 years ago

Hello,@balancap! Could you mind to tell me how to restore the pretrained model in 'checkpoint_path', when the 'checkpoint_path' in your train_ssd_network.py seem to be unused after its declaration?

villanuevab commented 7 years ago

@SunAriesCN see line 378 in train_ssd_network.py; in particular, see init_fn=tf_utils.get_init_fn(FLAGS). get_init_fn() in line 186 of tf_utils.py loads the latest checkpoint. There should also be an INFO TF logging/print statement to sanity check that get_init_fn() loaded the correct checkpoint.

@balancap Can you share your learning_rate TensorBoard chart as well as total_loss TensorBoard chart? I am experiencing similar behavior, not converging on training data even after 2+ days.

SunAriesCN commented 7 years ago

@villanuevab thank you, and I found this. But I just find others problem about ssd losses function , which the divisor under the smoothL1 and softmax loss seem not to be the number of matched default boxes as the paper's, and it's just a batch size.....Can someone tell me the reason about it?

siddharthm83 commented 7 years ago

@SeougnSeon @edocoh87 @balancap just checking to check the latest status on training using this codebase. I am going to try to go through the code and fix issues, but wanted to check here before i spend time.

siddharthm83 commented 7 years ago

I potentially found 1 bug, still not helping training though. The matching of anchor box with ground truth box has a bug: https://github.com/balancap/SSD-Tensorflow/blob/master/nets/ssd_common.py#L113-L114 @balancap why is it -0.5? Shouldnt line 113 be correct but you have commented it out. The matching strategy is also different from the paper, the paper ensures that each gt box has atleast 1 matched anchor, i couldnt find this in your code although I should still expect the loss to converge independent of this. Any thoughts welcome, in the meantime, I'll keep digging.

LevinJ commented 7 years ago

@siddharthm83 , Based on Paul's great code implementations, I made some changes, and was able to make the training process work to some degree.

train_eval

total_loss

The SSD model is initialized with VGG 16 weights trained on ImageNet. Training data is VOC 2007 and 2012 train_val, Testing data is VOC 2007 test, Final test accuracy is 0.65

If you are interested, you can see here for more details.

Zehaos commented 7 years ago

@LevinJ Good job! Could you please list what change have you make?

LevinJ commented 7 years ago

Sure @Zehaos . I listed the major changes I made in the Experimentation section of this link

Zehaos commented 7 years ago

@LevinJ Very clear! Thanks.

seasonyang commented 7 years ago

@LevinJ can you teach me how to train my own data（thousands of pictures but will detect only one object from them）。I follow balancap‘s fine-tuning method trained based on pretrained vgg-16，but the loss cannot conerge （also nearby 4.0）.

### what’s the right way to train my own data and use it to detect my own object？

”./tfrecords/voc2007“ is the path I created with my own data（1920x1080）。 my train script is that：

DATASET_DIR=./tfrecords/voc2007

TRAIN_DIR=./logs/my_chkp

CHECKPOINT_PATH=./checkpoints/vgg_16.ckpt

python3.4 train_ssd_network.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${CHECKPOINT_PATH} \
    --checkpoint_model_scope=vgg_16 \
    --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
    --trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --weight_decay=0.0005 \
    --optimizer=adam \
    --learning_rate=0.001 \
    --learning_rate_decay_factor=0.94 \
    --batch_size=64

ccyyy commented 6 years ago

Have you got any good solutions to the problem? @youngwanLEE

shiyuangogogo commented 6 years ago

@ithink2 which python version did you use to run SSD-Tensorflow? python2 or python3?

shiyuangogogo commented 6 years ago

@balancap which python version did you use to run SSD-Tensorflow? python2 or python3?

shiyuangogogo commented 6 years ago

@balancap which python version did you use to run SSD-Tensorflow? python2 or python3? And which version tensorflow ， cuda ， cudnn did you use to run SSD-Tensorflow?

shiyuangogogo commented 6 years ago

@ithink2 which python version did you use to run SSD-Tensorflow? python2 or python3? And which version tensorflow ， cuda ， cudnn did you use to run SSD-Tensorflow?

qianweilzh commented 3 years ago

I got this training result with vgg-16(fine-tuning a network trained on ImageNet) on VOC07 for 4 days using 1GPU.

train script :

DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords
CHECKPOINT_PATH=./checkpoints/vgg_16.ckpt
python train_ssd_network.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${CHECKPOINT_PATH} \
    --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \
    --checkpoint_model_scope=vgg_16 \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --weight_decay=0.0005 \
    --optimizer=rmsprop \
    --learning_rate=0.001 \
    --num_epochs_per_decay=200 \
    --batch_size=32 \
    --learning_rate_decay_factor=0.94

evaluation script :

TRAIN_DIR=/home/ywlee/SSD-Tensorflow/logs/vgg_300_0405/model.ckpt-468031
DATASET_DIR=/home/ywlee/data/VOC2007_TFRecords
EVAL_DIR=${TRAIN_DIR}/eval
python eval_ssd_network.py \
    --eval_dir=${EVAL_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2007 \
    --dataset_split_name=test \
    --model_name=ssd_300_vgg \
    --checkpoint_path=${TRAIN_DIR} \
    --batch_size=1

But the mAPs are

I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC07/mAP[0.00016903662625678594]
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC12/mAP[1.9445653552220332e-05]

I could't figure out why mAPs are so low.

@youngwanLEE Hallo, I'm wondering how did you achieved the convergence of the model? And why there's a sudden decrease of loss function at about 110k epochs?

balancap / SSD-Tensorflow

training voc2007 #36