Open derekhh opened 7 years ago
I am getting similar results, but with a much higher training loss that never drops below 40 in some cases (even after running ~100k steps).
I am using all default scripts, running with:
python train_ssd_network.py --train_dir=/summary/ssd_300_pascal --dat_dir=/home/blanca/my_data/voc2007_tfrecords/ --checkpoint_path=./checkpoints/ssd_300_vgg.ckpt --checkpoint_exclude_scopes=ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box --dataset_name=pascalvoc_2007 --dataset_split_name=train --model_name=ssd_300_vgg --save_summaries_secs=60 --save_interval_secs=60 --weight_decay=0.0005 --learning_rate=0.001 --learning_rate_decay_factor=0.96 --batch_size=32 --gpu_memory_fraction=0.8 --num_classes=21
And yet I still achieve a high mAP on the training set (this is for sanity check purposes. I have not tested on Pascal VOC test data yet):
INFO:tensorflow:Evaluation [5011/5011]
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC07/mAP[0.82002322122807825]
I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC12/mAP[0.855447542894266]
INFO:tensorflow:Finished evaluation at 2017-05-03-16:47:24
Time spent : 430.635 seconds.
Time spent per BATCH: 0.086 seconds.
Has anyone solved this issue?
In general, regularization losses are not included in model evaluation hence losses in training phase highly overestimate
Description
tl;dr: The pre-trained SSD checkpoint has a huge loss while training on the VOC2007 train + val dataset and it doesn't seem to be anywhere near convergence.
I was trying to use the pre-trained SSD checkpoint here to fine-tune the VOC 2007
train+val
dataset. I understand this doesn't make real sense as the SSD model is already pre-trained on the VOC dataset but I want to try it out for a sanity check after cloning the repo.Initially I followed this training script. The loss seems pretty huge ever after the initial step - which seems very weird. Then I noticed the learning rate probably is too high for fine-tuning and I changed it to something like
1e-5
and1e-6
. Still I'm observing huge training loss and it doesn't seem to be converging.DATASET_DIR=./tfrecords TRAIN_DIR=./logs/ CHECKPOINT_PATH=./checkpoints/ssd_300_vgg.ckpt python train_ssd_network.py \ --train_dir=${TRAIN_DIR} \ --dataset_dir=${DATASET_DIR} \ --dataset_name=pascalvoc_2012 \ --dataset_split_name=train \ --model_name=ssd_300_vgg \ --checkpoint_path=${CHECKPOINT_PATH} \ --save_summaries_secs=60 \ --save_interval_secs=600 \ --weight_decay=0.0005 \ --optimizer=adam \ --learning_rate=0.001 \ --batch_size=32
I've then evaluated the SSD checkpoint quality on both the VOC07 train + val dataset and VOC07
test
dataset to make sure the quality of the checkpoint is OK. I'm getting decent mAP values on both datasets so this confuses me a lot as I feel this should mean the loss is low while training on VOC07train+val
.I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC07/mAP[0.81998744666155454] I tensorflow/core/kernels/logging_ops.cc:79] AP_VOC12/mAP[0.85548588208867282] INFO:tensorflow:Finished evaluation at 2017-04-19-18:02:18 Time spent : 301.644 seconds. Time spent per BATCH: 0.481 seconds.
Can @balancap help take a look? And thanks again for sharing this repo with us!
@derekhh Can you share your eval.sh script? I'm wondering if you evaluated the ckpt before training?
Description
tl;dr: The pre-trained SSD checkpoint has a huge loss while training on the VOC2007 train + val dataset and it doesn't seem to be anywhere near convergence.
I was trying to use the pre-trained SSD checkpoint here to fine-tune the VOC 2007
train+val
dataset. I understand this doesn't make real sense as the SSD model is already pre-trained on the VOC dataset but I want to try it out for a sanity check after cloning the repo.Initially I followed this training script. The loss seems pretty huge ever after the initial step - which seems very weird. Then I noticed the learning rate probably is too high for fine-tuning and I changed it to something like
1e-5
and1e-6
. Still I'm observing huge training loss and it doesn't seem to be converging.I've then evaluated the SSD checkpoint quality on both the VOC07 train + val dataset and VOC07
test
dataset to make sure the quality of the checkpoint is OK. I'm getting decent mAP values on both datasets so this confuses me a lot as I feel this should mean the loss is low while training on VOC07train+val
.Can @balancap help take a look? And thanks again for sharing this repo with us!