Open xpandi-top opened 6 years ago
+1
I started getting this message after upgrading to tensorflow 1.10.0 (from 1.8.0). However, my custom tensorflow code still runs.
same problem here
same here at tensorflow 1.8.0
so,how can we do to solve this problem?
me too, have you solved this problem?
Same issue when running the evaluation script, and the mAP is extremely small
who can help us please? anybody solve this?
i got the same problem ,just like the result you had got . Have you solve it yet??? @xpandi-top @foamliu @prachiAeromana @HongyiDuanmu26 @kemangjaka Anyone solve it ??? Need your help. Thanks a lot!!!
Hi, I'm using Ubuntu 16.04 and tensorflow-gpu 1.10.0 now, and I couldn't reproduce the error. The evaluation worked fine. When I got the error, I used Windows.
What is your environment?
Thanks for reply! My environment :Ubuntu 18.04 + tensorflow-gpu 1.12.0 + python3.6 I changed my eval_ssd_network.py file and metrics.py file following the #321 ,and run eval_ssd_network.py successfully, but the result like this:
2019-03-07 09:54:28.724070: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[3.1303382901083712e-05] AP_VOC12/mAP[1.4904059586232956e-05]
Help me please! Maybe you can give me your eval_ssd_network.py file and metrics.py file if you don't mind. My email: 15215420373@163.com Thanks a lot!! @kemangjaka
ok. Correct me. I also got the same warning as you. But I didn't get such a low mAP. What kind of dataset do you use for the evaluation? I think it's not a problem of optimizer error.
The dataset used is VOCtest_06-Nov-2007, and the model is VGG_VOC0712_SSD_300x300_iter_120000.ckpt What is your mAP??? Are the dataset and model he same with me ? @kemangjaka
I downloaded VOCtest_06-Nov-2007 dataset, and evaluated with the VGG_VOC0712_SSD_300x300_iter_120000.ckpt model.
So, the command I typed is the following.
python eval_ssd_network.py --eval_dir=./log_2007/ --dataset_dir=./data/ --dataset_name=pascalvoc_2007 --dataset_split_name=test --model_name=ssd_300_vgg --checkpoint_path=./checkpoints/VGG_VOC0712_SSD_300x300_iter_120000.ckpt --batch_size=1
And I got the mAP below.
AP_VOC07/mAP[0.59928033284390148]
AP_VOC12/mAP[0.60921384902021813]
Still quite low but not too low I think.
BTW, I didn't do any modifications to metrics.py
The command i use are the same ,so as the dataset and the model. And the env is not a problem , could you please send your eval_ssd_network.py. file to my email so that i can have a try? @kemangjaka
Well, I only changed flatten part, and nothing changed from the original file. https://github.com/balancap/SSD-Tensorflow/issues/321#issuecomment-469188867
Could you try with tensorflow-gpu 1.10.0?
@kemangjaka @Sulince hi,have you solve the problem? I got that probelm too, and I cannot figure it out for so long. 019-03-08 22:41:45.604947: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[0.00010226225830017356] AP_VOC12/mAP[2.127145489434078e-05]
Hi, could you tell me the version of python, tensorflow, OS, and the command you typed? And also, did you modify any code from the original one?
@kemangjaka just like you said, I only add flatten function, and my env is: tf 1.10-gpu, python3.6, redhat4.8.5 I think my env is ok ,because I can run the tutorial example
and this is my code:
DATASET_DIR=/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/
EVAL_DIR=/export/userhome/liqiang/liqiang/Deeplearning/SSD/log_files/log_VOC2007/log_eval/
CHECKPOINT_PATH=/export/userhome/liqiang/liqiang/Deeplearning/SSD/ckpt/SSD_ckpt/VGG_VOC0712_SSD_300x300_iter_120000.ckpt/
CUDA_VISIBLE_DEVICES=3 python /export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-Tensorflow/eval_ssd_network.py \ --eval_dir=${EVAL_DIR} \ --dataset_dir=${DATASET_DIR} \ --dataset_name=pascalvoc_2007 \ --dataset_split_name=test \ --model_name=ssd_300_vgg \ --checkpoint_path=${CHECKPOINT_PATH} \ --batch_size=1 \
@petit-ami Could you post all of your output?
The command is exactly the same as mine. I don't know how to reproduce your results...
@kemangjaka here .please WARNING:tensorflow:From /export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-Tensorflow/eval_ssd_network.py:113: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step
=========================================================================== #
{'anchor_offset': 0.5, 'anchor_ratios': [[2, 0.5], [2, 0.5, 3, 0.3333333333333333], [2, 0.5, 3, 0.3333333333333333], [2, 0.5, 3, 0.3333333333333333], [2, 0.5], [2, 0.5]], 'anchor_size_bounds': [0.15, 0.9], 'anchor_sizes': [(21.0, 45.0), (45.0, 99.0), (99.0, 153.0), (153.0, 207.0), (207.0, 261.0), (261.0, 315.0)], 'anchor_steps': [8, 16, 32, 64, 100, 300], 'feat_layers': ['block4', 'block7', 'block8', 'block9', 'block10', 'block11'], 'feat_shapes': [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)], 'img_shape': (300, 300), 'no_annotation_label': 21, 'normalizations': [20, -1, -1, -1, -1, -1], 'num_classes': 21, 'prior_scaling': [0.1, 0.1, 0.2, 0.2]}
['/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_000.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_001.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_002.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_003.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_004.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_005.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_006.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_007.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_008.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_009.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_010.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_011.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_012.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_013.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_014.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_015.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_016.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_017.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_018.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_019.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_020.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_021.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_022.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_023.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_024.tfrecord']
WARNING:tensorflow:From /export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-Tensorflow/eval_ssd_network.py:226: streaming_mean (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.metrics.mean INFO:tensorflow:Evaluating None INFO:tensorflow:Starting evaluation at 2019-03-08-14:29:14 INFO:tensorflow:Graph was finalized. 2019-03-08 22:29:14.520042: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-03-08 22:29:14.971772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:84:00.0 totalMemory: 11.90GiB freeMemory: 4.26GiB 2019-03-08 22:29:14.971931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-03-08 22:29:21.058607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-08 22:29:21.058689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-03-08 22:29:21.058710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-03-08 22:29:21.072137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1218 MB memory) -> physical GPU (device: 0, name: TITAN X (Pascal), pci bus id: 0000:84:00.0, compute capability: 6.1) INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. 2019-03-08 22:29:45.913166: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.126467: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.147991: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.211770: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.37GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.239826: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.243550: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.35GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.333402: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. INFO:tensorflow:Evaluation [495/4952] INFO:tensorflow:Evaluation [990/4952] INFO:tensorflow:Evaluation [1485/4952] INFO:tensorflow:Evaluation [1980/4952] INFO:tensorflow:Evaluation [2475/4952] INFO:tensorflow:Evaluation [2970/4952] INFO:tensorflow:Evaluation [3465/4952] INFO:tensorflow:Evaluation [3960/4952] INFO:tensorflow:Evaluation [4455/4952] INFO:tensorflow:Evaluation [4950/4952] INFO:tensorflow:Evaluation [4952/4952] 2019-03-08 22:41:45.604947: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[0.00010226225830017356] AP_VOC12/mAP[2.127145489434078e-05] INFO:tensorflow:Finished evaluation at 2019-03-08-14:43:15 Time spent : 841.545 seconds. Time spent per BATCH: 0.170 seconds.
I found it. In your log, it says,
INFO:tensorflow:Evaluating None
That means you cannot load trained file properly. Your evaluation is conducted with the initialized random weights network. The path of the checkpoint is correct?
@Sulince maybe your problem is exactly the same as this one.
ARNING:tensorflow:From /export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-Tensorflow/eval_ssd_network.py:226: streaming_mean (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.metrics.mean INFO:tensorflow:Evaluating /export/userhome/liqiang/liqiang/Deeplearning/SSD/ckpt/SSD_ckpt/VGG_VOC0712_SSD_300x300_iter_120000.ckpt/VGG_VOC0712_SSD_300x300_iter_120000.ckpt INFO:tensorflow:Starting evaluation at 2019-03-08-15:25:58 INFO:tensorflow:Graph was finalized. 2019-03-08 23:25:58.450231: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-03-08 23:25:58.889497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:84:00.0 totalMemory: 11.90GiB freeMemory: 4.26GiB 2019-03-08 23:25:58.889608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-03-08 23:26:13.981284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-08 23:26:13.981353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-03-08 23:26:13.981373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-03-08 23:26:14.010918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1218 MB memory) -> physical GPU (device: 0, name: TITAN X (Pascal), pci bus id: 0000:84:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from /export/userhome/liqiang/liqiang/Deeplearning/SSD/ckpt/SSD_ckpt/VGG_VOC0712_SSD_300x300_iter_120000.ckpt/VGG_VOC0712_SSD_300x300_iter_120000.ckpt INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. 2019-03-08 23:26:38.264869: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.448399: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.469665: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.472249: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.519175: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.37GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.572346: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.575961: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.35GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.630726: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.06GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.652513: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. INFO:tensorflow:Evaluation [495/4952] INFO:tensorflow:Evaluation [990/4952] INFO:tensorflow:Evaluation [1485/4952] INFO:tensorflow:Evaluation [1980/4952] INFO:tensorflow:Evaluation [2475/4952] INFO:tensorflow:Evaluation [2970/4952] INFO:tensorflow:Evaluation [3465/4952] INFO:tensorflow:Evaluation [3960/4952] INFO:tensorflow:Evaluation [4455/4952] INFO:tensorflow:Evaluation [4950/4952] INFO:tensorflow:Evaluation [4952/4952] 2019-03-08 23:34:59.828743: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[0.59928033284390148] AP_VOC12/mAP[0.60921384904878606] INFO:tensorflow:Finished evaluation at 2019-03-08-15:35:12 Time spent : 554.773 seconds. Time spent per BATCH: 0.112 seconds.
Thanks you!!!! that is , I need to plus another .ckpt. Thanks so much
I have solve this problem, the solution just as kemangjaka said ''INFO:tensorflow:Evaluating None", just change another model file ,It will be work. I think the model named VGG_VOC0712_SSD_300x300_iter_120000.ckpt in the repository has something wrong, so do not use it and find another one. @kemangjaka @petit-ami
By the way , have you train the model successfully? when i train the model in VOC07+12 dataset ,my loss is high and shake as follows: INFO:tensorflow:Recording summary at step 62230. INFO:tensorflow:global step 62240: loss = 40.2912 (0.496 sec/step) INFO:tensorflow:global step 62250: loss = 40.6664 (0.493 sec/step) INFO:tensorflow:global step 62260: loss = 40.5154 (0.502 sec/step) INFO:tensorflow:global step 62270: loss = 23.9944 (0.487 sec/step) INFO:tensorflow:global step 62280: loss = 21.0998 (0.501 sec/step) INFO:tensorflow:global step 62290: loss = 39.5273 (0.505 sec/step) INFO:tensorflow:global step 62300: loss = 28.9741 (0.522 sec/step) INFO:tensorflow:global step 62310: loss = 33.9893 (0.504 sec/step) INFO:tensorflow:global step 62320: loss = 31.2430 (0.517 sec/step) INFO:tensorflow:global step 62330: loss = 50.1789 (0.500 sec/step) INFO:tensorflow:global step 62340: loss = 16.4918 (0.493 sec/step)
here is my paras: DATASET_DIR=/home/sulince/SSD_tensorflow/VOC0713/tfrecords/ TRAIN_DIR=/home/sulince/SSD_tensorflow/train_model/ CHECKPOINT_PATH=/home/sulince/SSD_tensorflow/checkpoints/vgg_16.ckpt
python3 /home/sulince/SSD_tensorflow/train_ssd_network.py \ --train_dir=${TRAIN_DIR} \ --dataset_dir=${DATASET_DIR} \ --dataset_name=pascalvoc_2007 \ --dataset_split_name=train \ --model_name=ssd_300_vgg \ --checkpoint_path=${CHECKPOINT_PATH} \ --checkpoint_model_scope=vgg_16 \ --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \ --trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \ --save_summaries_secs=60 \ --save_interval_secs=600 \ --weight_decay=0.0005 \ --optimizer=adam \ --learning_rate=0.001 \ --learning_rate_decay_factor=0.94 \ --batch_size=16 \ @--gpu_memory_fraction=0.9 what is your loss?? @kemangjaka @petit-ami
@Sulince I was using VGG_VOC0712_SSD_300x300_iter_120000.ckpt yesterday, it worked.
AP_VOC07/mAP[0.59928033284390148] AP_VOC12/mAP[0.60921384904878606]
and today I run another one, named VGG_VOC0712_SSD_300x300_ft_iter_120000.ckpt it also worked but have higher mAP @kemangjaka
AP_VOC07/mAP[0.74313215403145927] AP_VOC12/mAP[0.76659716498723329]
I am fine-tuning existing SSD checkpoint VGG_VOC0712_SSD_300x300_ft_iter_120000.ckpt , but loss connot converge, shaking around 100. DATASET_DIR=/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2012/VOCtrainval_11-May-2012/VOCdevkit/VOC2012_tfrecord/
TRAIN_DIR=/export/userhome/liqiang/liqiang/Deeplearning/SSD/log_files/log_finetune_2012/ CHECKPOINT_PATH=/export/userhome/liqiang/liqiang/Deeplearning/SSD/log_files/log_finetune_2012/model.ckpt-40000 CUDA_VISIBLE_DEVICES=2 python /export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-Tensorflow/train_ssd_network.py \ --train_dir=${TRAIN_DIR} \ --dataset_dir=${DATASET_DIR} \ --dataset_name=pascalvoc_2012 \ --dataset_split_name=train \ --model_name=ssd_300_vgg \ --CHECKPOINT_PATH=${CHECKPOINT_PATH} \ --save_summaries_secs=60 \ --save_interval_secs=600 \ --weight_deacy=0.05 \ --optimizer=adam \ --learning_rate=0.00000005 \ --batch_size=32 \
@Sulince and I remember last time I trained VGG16, ang also got similiar result as yours. I am working in solving it.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/input.py:187: QueueRunner.init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data
module.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/input.py:187: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data
module.
WARNING:tensorflow:From eval_ssd_network.py:231: streaming_mean (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.metrics.mean
INFO:tensorflow:Evaluating ./aug_ckout/model.ckpt-8415
INFO:tensorflow:Starting evaluation at 2019-03-19-05:37:03
INFO:tensorflow:Graph was finalized.
2019-03-19 13:37:04.015866: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-19 13:37:04.093802: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-19 13:37:04.094138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 5.93GiB freeMemory: 5.35GiB
2019-03-19 13:37:04.094152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-03-19 13:37:04.282793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-19 13:37:04.282824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2019-03-19 13:37:04.282830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2019-03-19 13:37:04.282990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 607 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from ./aug_ckout/model.ckpt-8415
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py:804: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the tf.data
module.
2019-03-19 13:37:07.813959: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 828.12MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-19 13:37:07.857173: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-19 13:37:07.876847: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 610.31MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-19 13:37:07.909925: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 814.50MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-19 13:37:07.998780: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 550.42MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-19 13:37:08.018300: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-19 13:37:08.042239: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-19 13:37:08.043062: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-19 13:37:08.102594: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-19 13:37:08.106842: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.35GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
INFO:tensorflow:Evaluation [195/1952]
INFO:tensorflow:Evaluation [390/1952]
INFO:tensorflow:Evaluation [585/1952]
INFO:tensorflow:Evaluation [780/1952]
INFO:tensorflow:Evaluation [975/1952]
INFO:tensorflow:Evaluation [1170/1952]
INFO:tensorflow:Evaluation [1365/1952]
INFO:tensorflow:Evaluation [1560/1952]
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Reduction axis 0 is empty in shape [0]
[[{{node bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/ArgMax}} = ArgMax[T=DT_FLOAT, Tidx=DT_INT32, output_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/mul, bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/bboxes_jaccard/transpose_1/Range/start)]]
[[{{node ssd_losses/cross_entropy_pos/value/_524}} = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3865_ssd_losses/cross_entropy_pos/value", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "eval_ssd_network.py", line 361, in
Caused by op 'bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/ArgMax', defined at:
File "eval_ssd_network.py", line 361, in
InvalidArgumentError (see above for traceback): Reduction axis 0 is empty in shape [0] [[{{node bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/ArgMax}} = ArgMax[T=DT_FLOAT, Tidx=DT_INT32, output_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/mul, bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/bboxes_jaccard/transpose_1/Range/start)]] [[{{node ssd_losses/cross_entropy_pos/value/_524}} = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3865_ssd_losses/cross_entropy_pos/value", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
aMP is so low?anyone can help me?thank you!!
@petit-ami I saw that you were training and testing on PASCAL VOC 2012. I am training and evaluation on PASCAL VOC 2012 right now.
I trained on the trainval of 2012 (17125 items) and tested on the test of 2012 (5138 items). But the mAP is about 0.038 either the training based on ssd_300_vgg.ckpt
or not. Did you get a decent mAP? Could you give me some help, please?
I also trained on 07+12(trainval of 2007 + trainval of 2012), and 07++12 (trainval & test of 2007 and trainval of 2012), the results are almost the same.
Sincerely
I found it. In your log, it says,
INFO:tensorflow:Evaluating None
That means you cannot load trained file properly. Your evaluation is conducted with the initialized random weights network. The path of the checkpoint is correct?
@Sulince maybe your problem is exactly the same as this one.
thanks,I solve my problem in your way
@SunNYNO1 Did you do the evaluation on Pascal VOC 2012?
I evaluated using ssd_300_vgg.ckpt
and VGG_VOC0712_SSD_300x300_iter_120000.ckpt
on dataset Pascal VOC 2012. But I got the results below:
Could you please give me some help, please?
Sincerely
zhengjixing@amax1:~/SSD-Tensorflow-master$ python eval_ssd_network.py --eval_dir=./logs --dataset_dir=./test --datset_name=pascalvoc_2007 --dataset_split_name=test --model_name=ssd_300_vgg --checkpoint_path=./checkpoints/ssd_300_vgg.ckpt --batch_size=1
/home/fancy/program/anaconda2/lib/python2.7/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float
to np.floating
is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type
.
from ._conv import register_converters as _register_converters
WARNING:tensorflow:From eval_ssd_network.py:113: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
Traceback (most recent call last):
File "eval_ssd_network.py", line 346, in
How to solve this problem?@petit-ami
when I put two files of VGG_VOC0712_SSD_300x300_iter_120000.ckpt into the directory './checkpoints/VGG_VOC0712_SSD_300x300_iter_120000.ckpt', the result: AP_VOC07/mAP[0.67390402123709192] AP_VOC12/mAP[0.69139019683779168] when I do not put two files of VGG_VOC0712_SSD_300x300_iter_120000.ckpt into the directory './checkpoints/VGG_VOC0712_SSD_300x300_iter_120000.ckpt', the result: AP_VOC07/mAP[0.00010226225830017356] AP_VOC12/mAP[2.127145489434078e-05]
so you can check the method you unzip the VGG_VOC0712_SSD_300x300_iter_120000.ckpt. but I do not know the reason.
I found it. In your log, it says,
INFO:tensorflow:Evaluating None
That means you cannot load trained file properly. Your evaluation is conducted with the initialized random weights network. The path of the checkpoint is correct?
@Sulince maybe your problem is exactly the same as this one.
Thank you!
@jixingzheng , Do the below steps.
Thanks for reply! My environment :Ubuntu 18.04 + tensorflow-gpu 1.12.0 + python3.6 I changed my eval_ssd_network.py file and metrics.py file following the #321 ,and run eval_ssd_network.py successfully, but the result like this:
2019-03-07 09:54:28.724070: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[3.1303382901083712e-05] AP_VOC12/mAP[1.4904059586232956e-05]
Help me please! Maybe you can give me your eval_ssd_network.py file and metrics.py file if you don't mind. My email: 15215420373@163.com Thanks a lot!! @kemangjaka
Can you tell me you env with cuda+tensorflow+python.
when run eval_ssd_network.py, meets this
2018-07-24 18:48:02.126672: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:233] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[0.20843321784099034] AP_VOC12/mAP[0.20235189944609927]