balancap / SSD-Tensorflow

Single Shot MultiBox Detector in TensorFlow
4.11k stars 1.89k forks source link

about eval map. Please help me. Thank you!! #333

Open JiangniHIT opened 5 years ago

JiangniHIT commented 5 years ago

WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/input.py:187: QueueRunner.init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/input.py:187: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. WARNING:tensorflow:From eval_ssd_network.py:231: streaming_mean (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.metrics.mean INFO:tensorflow:Evaluating ./aug_ckout/model.ckpt-8415 INFO:tensorflow:Starting evaluation at 2019-03-19-06:36:41 INFO:tensorflow:Graph was finalized. 2019-03-19 14:36:41.644905: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-03-19 14:36:41.722446: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-03-19 14:36:41.722789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705 pciBusID: 0000:01:00.0 totalMemory: 5.93GiB freeMemory: 5.35GiB 2019-03-19 14:36:41.722803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0 2019-03-19 14:36:41.911495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-19 14:36:41.911527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 2019-03-19 14:36:41.911533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N 2019-03-19 14:36:41.911693: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 607 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from ./aug_ckout/model.ckpt-8415 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py:804: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. 2019-03-19 14:36:45.470999: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 828.12MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.513889: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.533980: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 610.31MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.569713: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 814.50MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.665355: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 550.42MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.684629: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.707972: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.708826: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.771061: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.775322: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.35GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. INFO:tensorflow:Evaluation [97/976] INFO:tensorflow:Evaluation [194/976] INFO:tensorflow:Evaluation [291/976] INFO:tensorflow:Evaluation [388/976] INFO:tensorflow:Evaluation [485/976] INFO:tensorflow:Evaluation [582/976] INFO:tensorflow:Evaluation [679/976] INFO:tensorflow:Evaluation [776/976] INFO:tensorflow:Evaluation [873/976] INFO:tensorflow:Evaluation [970/976] INFO:tensorflow:Evaluation [976/976] 2019-03-19 14:37:47.028473: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[0.031098607904279731] AP_VOC12/mAP[0.0036915150366450228] INFO:tensorflow:Finished evaluation at 2019-03-19-06:37:51 Time spent : 69.925 seconds. Time spent per BATCH: 0.072 seconds.

cjnjuwhy commented 5 years ago

maybe something wrong with your model, training from scratch is hard to get high result according to other issues.

ylqi007 commented 5 years ago

@JiangniHIT @cjnjuwhy I trained from scratch on PASCAL VOC 2007 and also trained from scratch on PASCAL VOC 2012 for about 20000 steps. But the test results also most like yours @JiangniHIT.

CHECKPOINT_PATH=./checkpoints/ssd_300_vgg.ckpt
DATASET_DIR=/tmp/tfrecords_pascal_2007
TRAIN_DIR=./logs/PASCAL_VOC_2007/Original_Without_ckpt/

python3 train_ssd_network.py \
    --train_dir=${TRAIN_DIR} \
    --dataset_dir=${DATASET_DIR} \
    --dataset_name=pascalvoc_2007 \
    --num_classes=21 \
    --dataset_split_name=train \
    --model_name=ssd_300_vgg \
    --save_summaries_secs=60 \
    --save_interval_secs=600 \
    --optimizer=adam \
    --max_number_of_steps=20000 \
    --learning_rate=0.0001 \
    --batch_size=32 \
    --match_threshold=0.5

Evaluation result on Pascal VOC 2007 after 15000 steps: Original_Without_ckpt_15000_0 001

Evaluation result on Pscal VOC 2012 after 7000 steps: Original_Without_ckpt_7000

There are two possible reasons I guess:

  1. The training steps are not enough;
  2. The code does not work.

For now, I prefer the second reason that the code has problem.

cjnjuwhy commented 5 years ago

@ylqi007 Actually, the code don't have problem nor the iteration is really enough to get a >0.10 results. I figure out that train-from-scratch is different with fine-tune, so you need to magnify to learning_rate and reduce the decay, so the learner can learn fast in the beginning. I tried:

lr=0.01 decay=0.0001 after 50k steps, I get mAP 0.338

屏幕快照 2019-04-04 上午9 55 16
ylqi007 commented 5 years ago

@cjnjuwhy Very appreciated your patience. I changed the learning rate from 0.001 to 0.00001 during the training process and trained for 60k steps. But the result is still very poor. How did you change the learning rate and weight decay? From your result, the mAP do increases, but can you reproduce the result like 0.743 something?

Sincerely

cjnjuwhy commented 5 years ago

@ylqi007 Sorry to reply your question so late, actually I was training the model in these days, the best performance I have got is mAP 0.46, and I'm still trying to increase the result, but the training process is really time-consuming. I set lr=0.01, and decay_factor=0.97 and mini-batch is 16, so after 50k steps(around 160 epochs) the learning rate drop from 0.01 to about 0.001, and get the mAP 0.46. After trying, I find out lr=0.01 is proper to start the training from scratch, and should keep lr > 0.001 within a long period(for example, 100k steps). And I'm wonder how to get higher mAP and whether train from scratch can get high mAP. 😐

ylqi007 commented 5 years ago

@cjnjuwhy I am training the model recently. And I have found that if you start training from scratch, it is difficult to get high mAP let alone reproduce the results. If you fine-tune based on ssd_300_vgg.ckpt, actually the performance decreases, because the evaluation of ssd_300_vgg.ckpt can get 0.815, while the result of fine-tuning just gets around 0.74.

cjnjuwhy commented 5 years ago

@ylqi007 The learning rate really maters, for the stable model, should be small enough to ensure stableness. I have tried many times, and still can’t get a mAP more than 0.5, sad enough.

Pabalachina commented 5 years ago

@ylqi007 The learning rate really maters, for the stable model, should be small enough to ensure stableness. I have tried many times, and still can’t get a mAP more than 0.5, sad enough.

Hello,I also encountered with low map problem.But i trained with ssd_512_vgg model,and finetuned from a pretrained vgg_16.ckpt. I trained on VOC2007 and received best map=0.556.After that, I tried reducing learning rate ,but the map line just does't rise. I expand trainval by combining VOC07 andVOC12, but the map declines. Have you solve your problem?

yuchanWang commented 5 years ago

My object detection just contains person. And i want to detect the pedestrian from the overhead view, so i change the num_class and the voc_label. When i use the pre-trained network checkpoints like ssd_300_vgg or vgg_16, i met the problem of mismatch between my graph and the checkpoint graph, which i can't solve. Then I train the network from scratch without giving it any checkpoints. I set the lr=0.01,decay=0.97 and my batch=32. After 60000 steps,i got the VOC07 mAP 0.901 and a higher VOC 2012 mAP 0.985. It looks a little bit strange, but my test datasets are different from my train datasets. Maybe there's some overlap between these two datasets causing the overfitting? Or the result is no problem, just because my object is single and the training steps are enough?

chengcchn commented 5 years ago

@yupeihua Is that possible you share the code you have runed? I have spent a lot of time on this code, but still can not get a good result.

cnuzh commented 5 years ago

WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/input.py:187: QueueRunner.init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/input.py:187: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. WARNING:tensorflow:From eval_ssd_network.py:231: streaming_mean (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.metrics.mean INFO:tensorflow:Evaluating ./aug_ckout/model.ckpt-8415 INFO:tensorflow:Starting evaluation at 2019-03-19-06:36:41 INFO:tensorflow:Graph was finalized. 2019-03-19 14:36:41.644905: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-03-19 14:36:41.722446: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-03-19 14:36:41.722789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705 pciBusID: 0000:01:00.0 totalMemory: 5.93GiB freeMemory: 5.35GiB 2019-03-19 14:36:41.722803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0 2019-03-19 14:36:41.911495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-19 14:36:41.911527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 2019-03-19 14:36:41.911533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N 2019-03-19 14:36:41.911693: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 607 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from ./aug_ckout/model.ckpt-8415 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py:804: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. 2019-03-19 14:36:45.470999: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 828.12MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.513889: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.533980: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 610.31MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.569713: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 814.50MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.665355: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 550.42MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.684629: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.707972: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.708826: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.771061: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 14:36:45.775322: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.35GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. INFO:tensorflow:Evaluation [97/976] INFO:tensorflow:Evaluation [194/976] INFO:tensorflow:Evaluation [291/976] INFO:tensorflow:Evaluation [388/976] INFO:tensorflow:Evaluation [485/976] INFO:tensorflow:Evaluation [582/976] INFO:tensorflow:Evaluation [679/976] INFO:tensorflow:Evaluation [776/976] INFO:tensorflow:Evaluation [873/976] INFO:tensorflow:Evaluation [970/976] INFO:tensorflow:Evaluation [976/976] 2019-03-19 14:37:47.028473: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[0.031098607904279731] AP_VOC12/mAP[0.0036915150366450228] INFO:tensorflow:Finished evaluation at 2019-03-19-06:37:51 Time spent : 69.925 seconds. Time spent per BATCH: 0.072 seconds.

Hi, When I run eval_ssd_network.py, it did not appear this line prompt "To construct input pipelines, use the tf.data module." , I would like to ask you know where I have a problem.In other words, what i should do to appear this line prompt.Thanks!