endernewton / tf-faster-rcnn

Tensorflow Faster RCNN for Object Detection
https://arxiv.org/pdf/1702.02138.pdf
MIT License
3.65k stars 1.57k forks source link

InvalidArgumentError (see above for traceback): Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight #8

Open monajalal opened 7 years ago

monajalal commented 7 years ago

I followed your instruction and got this error. Can you please suggest solutions?

mona@pascal:~/computer_vision/tf-faster-rcnn$ GPU_ID=0
mona@pascal:~/computer_vision/tf-faster-rcnn$ ./experiments/scripts/vgg16.sh $GPU_ID pascal_voc
+ set -e
+ export PYTHONUNBUFFERED=True
+ PYTHONUNBUFFERED=True
+ GPU_ID=0
+ DATASET=pascal_voc
+ array=($@)
+ len=2
+ EXTRA_ARGS=
+ EXTRA_ARGS_SLUG=
+ case ${DATASET} in
+ TRAIN_IMDB=voc_2007_trainval
+ TEST_IMDB=voc_2007_test
+ STEPSIZE=50000
+ ITERS=70000
++ date +%Y-%m-%d_%H-%M-%S
+ LOG=experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-02-14_22-08-43
+ exec
++ tee -a experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-02-14_22-08-43
tee: experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-02-14_22-08-43: No such file or directory
+ echo Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-02-14_22-08-43
Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-02-14_22-08-43
+ set +x
+ '[' '!' -f output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt.index ']'
+ [[ ! -z '' ]]
+ CUDA_VISIBLE_DEVICES=0
+ time python ./tools/trainval_vgg16_net.py --weight data/imagenet_weights/vgg16.weights --imdb voc_2007_trainval --imdbval voc_2007_test --iters 70000 --cfg experiments/cfgs/vgg16.yml --set TRAIN.STEPSIZE 50000
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Called with args:
Namespace(cfg_file='experiments/cfgs/vgg16.yml', imdb_name='voc_2007_trainval', imdbval_name='voc_2007_test', max_iters=70000, set_cfgs=['TRAIN.STEPSIZE', '50000'], tag=None, weight='data/imagenet_weights/vgg16.weights')
Using config:
{'DATA_DIR': '/home/mona/computer_vision/tf-faster-rcnn/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'vgg16',
 'GPU_ID': 0,
 'MATLAB': 'matlab',
 'PIXEL_MEANS': array([[[ 102.9801,  115.9465,  122.7717]]]),
 'POOLING_MODE': 'crop',
 'RNG_SEED': 3,
 'ROOT_DIR': '/home/mona/computer_vision/tf-faster-rcnn',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'MODE': 'nms',
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'selective_search',
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'RPN_TOP_N': 5000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECT_GROUPING': False,
           'BATCH_SIZE': 256,
           'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'BIAS_DECAY': False,
           'DISPLAY': 20,
           'DOUBLE_BIAS': True,
           'FG_FRACTION': 0.25,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'LEARNING_RATE': 0.001,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_KEPT': 3,
           'SNAPSHOT_PREFIX': 'vgg16_faster_rcnn',
           'STEPSIZE': 50000,
           'SUMMARY_INTERVAL': 180,
           'TRUNCATED': False,
           'USE_FLIPPED': True,
           'USE_GT': False,
           'WEIGHT_DECAY': 0.0005},
 'USE_GPU_NMS': True}
Loaded dataset `voc_2007_trainval` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /home/mona/computer_vision/tf-faster-rcnn/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
10022 roidb entries
Output will be saved to `/home/mona/computer_vision/tf-faster-rcnn/output/vgg16/voc_2007_trainval/default`
TensorFlow summaries will be saved to `/home/mona/computer_vision/tf-faster-rcnn/tensorboard/vgg16/voc_2007_trainval/default`
Loaded dataset `voc_2007_test` for training
Set proposal method: gt
Preparing training data...
voc_2007_test gt roidb loaded from /home/mona/computer_vision/tf-faster-rcnn/data/cache/voc_2007_test_gt_roidb.pkl
done
4952 validation roidb entries
Filtered 0 roidb entries: 10022 -> 10022
Filtered 0 roidb entries: 4952 -> 4952
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:03:00.0
Total memory: 11.92GiB
Free memory: 11.85GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:03:00.0)
Solving...
Loading caffe weights...
Done!
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:91: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Loading initial model weights from data/imagenet_weights/vgg16.weights
Loaded.
iter: 20 / 70000, total loss: 0.443026
 >>> rpn_loss_cls: 0.345992
 >>> rpn_loss_box: 0.097034
 >>> loss_cls: 0.000000
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.749s / iter
iter: 40 / 70000, total loss: 0.516920
 >>> rpn_loss_cls: 0.399234
 >>> rpn_loss_box: 0.117686
 >>> loss_cls: 0.000000
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.760s / iter
iter: 60 / 70000, total loss: 0.393830
 >>> rpn_loss_cls: 0.353334
 >>> rpn_loss_box: 0.040496
 >>> loss_cls: 0.000000
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.668s / iter
iter: 80 / 70000, total loss: 0.217178
 >>> rpn_loss_cls: 0.146591
 >>> rpn_loss_box: 0.070533
 >>> loss_cls: 0.000053
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.536s / iter
iter: 100 / 70000, total loss: 0.390607
 >>> rpn_loss_cls: 0.277706
 >>> rpn_loss_box: 0.030601
 >>> loss_cls: 0.075361
 >>> loss_box: 0.006940
 >>> lr: 0.001000
speed: 1.495s / iter
iter: 120 / 70000, total loss: 0.882707
 >>> rpn_loss_cls: 0.566185
 >>> rpn_loss_box: 0.227990
 >>> loss_cls: 0.083081
 >>> loss_box: 0.005452
 >>> lr: 0.001000
speed: 1.570s / iter
iter: 140 / 70000, total loss: 0.223789
 >>> rpn_loss_cls: 0.113045
 >>> rpn_loss_box: 0.049687
 >>> loss_cls: 0.052417
 >>> loss_box: 0.008640
 >>> lr: 0.001000
speed: 1.510s / iter
iter: 160 / 70000, total loss: 0.219555
 >>> rpn_loss_cls: 0.187197
 >>> rpn_loss_box: 0.032358
 >>> loss_cls: 0.000000
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.494s / iter
iter: 180 / 70000, total loss: 2.256282
 >>> rpn_loss_cls: 1.965876
 >>> rpn_loss_box: 0.290406
 >>> loss_cls: 0.000000
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.475s / iter
iter: 200 / 70000, total loss: 1.727870
 >>> rpn_loss_cls: 1.226427
 >>> rpn_loss_box: 0.501443
 >>> loss_cls: 0.000000
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.463s / iter
iter: 220 / 70000, total loss: 0.353863
 >>> rpn_loss_cls: 0.298823
 >>> rpn_loss_box: 0.055040
 >>> loss_cls: 0.000000
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.461s / iter
iter: 240 / 70000, total loss: 0.147688
 >>> rpn_loss_cls: 0.039554
 >>> rpn_loss_box: 0.108122
 >>> loss_cls: 0.000012
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.450s / iter
iter: 260 / 70000, total loss: 0.485889
 >>> rpn_loss_cls: 0.416970
 >>> rpn_loss_box: 0.068911
 >>> loss_cls: 0.000009
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.428s / iter
iter: 280 / 70000, total loss: 0.153297
 >>> rpn_loss_cls: 0.108915
 >>> rpn_loss_box: 0.044243
 >>> loss_cls: 0.000139
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.440s / iter
iter: 300 / 70000, total loss: 0.374053
 >>> rpn_loss_cls: 0.310106
 >>> rpn_loss_box: 0.063945
 >>> loss_cls: 0.000001
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.397s / iter
iter: 320 / 70000, total loss: 1.169239
 >>> rpn_loss_cls: 1.099040
 >>> rpn_loss_box: 0.070199
 >>> loss_cls: 0.000000
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.385s / iter
iter: 340 / 70000, total loss: 0.243177
 >>> rpn_loss_cls: 0.193078
 >>> rpn_loss_box: 0.049057
 >>> loss_cls: 0.001042
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.370s / iter
iter: 360 / 70000, total loss: 0.387752
 >>> rpn_loss_cls: 0.375503
 >>> rpn_loss_box: 0.012084
 >>> loss_cls: 0.000166
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.353s / iter
iter: 380 / 70000, total loss: 0.494936
 >>> rpn_loss_cls: 0.312221
 >>> rpn_loss_box: 0.045870
 >>> loss_cls: 0.136845
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.336s / iter
/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:48: RuntimeWarning: overflow encountered in exp
  pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:48: RuntimeWarning: overflow encountered in multiply
  pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:49: RuntimeWarning: overflow encountered in exp
  pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:49: RuntimeWarning: overflow encountered in multiply
  pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:55: RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h
iter: 400 / 70000, total loss: nan
 >>> rpn_loss_cls: nan
 >>> rpn_loss_box: nan
 >>> loss_cls: 3.037189
 >>> loss_box: 0.000000
 >>> lr: 0.001000
speed: 1.321s / iter
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
W tensorflow/core/framework/op_kernel.cc:975] Invalid argument: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]
Traceback (most recent call last):
  File "./tools/trainval_vgg16_net.py", line 117, in <module>
    max_iters=args.max_iters)
  File "/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/model/train_val.py", line 304, in train_net
    sw.train_model(sess, max_iters)
  File "/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/model/train_val.py", line 197, in train_model
    self.net.train_step_with_summary(sess, blobs, train_op)
  File "/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/nets/vgg16.py", line 561, in train_step_with_summary
    feed_dict=feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]

Caused by op u'TRAIN/vgg16_default/conv3_1/weight', defined at:
  File "./tools/trainval_vgg16_net.py", line 117, in <module>
    max_iters=args.max_iters)
  File "/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/model/train_val.py", line 304, in train_net
    sw.train_model(sess, max_iters)
  File "/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/model/train_val.py", line 91, in train_model
    tag='default', anchor_scales=anchors)
  File "/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/nets/vgg16.py", line 507, in create_architecture
    self._add_train_summary(var)
  File "/home/mona/computer_vision/tf-faster-rcnn/tools/../lib/nets/vgg16.py", line 48, in _add_train_summary
    tf.summary.histogram('TRAIN/' + var.op.name, var)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/summary/summary.py", line 205, in histogram
    tag=scope.rstrip('/'), values=values, name=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 139, in _histogram_summary
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: TRAIN/vgg16_default/conv3_1/weight
     [[Node: TRAIN/vgg16_default/conv3_1/weight = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg16_default/conv3_1/weight/tag, vgg16_default/conv3_1/weight/read/_269)]]

E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:652] Deallocating stream with pending work
Command exited with non-zero status 1
435.97user 110.56system 9:22.01elapsed 97%CPU (0avgtext+0avgdata 2976644maxresident)k
60224inputs+2752outputs (4major+2126190minor)pagefaults 0swaps
mona@pascal:~/computer_vision/tf-faster-rcnn$ 
endernewton commented 7 years ago

The zero losses look suspicious, and the loss should not be this low in the first iterations. Could you check if the compilation on k40 is the issue? BTW the code has updated a bit now, maybe you want to refork it.

monajalal commented 7 years ago

This is what happened after I did git pull and ran the training:

mona@pascal:~/computer_vision/tf-faster-rcnn$ git pull
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/endernewton/tf-faster-rcnn
   9731cc0..83bc041  master     -> origin/master
Updating 9731cc0..83bc041
Fast-forward
 README.md | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

mona@pascal:~/computer_vision/tf-faster-rcnn$ GPU_ID=0
mona@pascal:~/computer_vision/tf-faster-rcnn$ ./experiments/scripts/test_vgg16.sh $GPU_ID pascal_voc
+ set -e
+ export PYTHONUNBUFFERED=True
+ PYTHONUNBUFFERED=True
+ GPU_ID=0
+ DATASET=pascal_voc
+ array=($@)
+ len=2
+ EXTRA_ARGS=
+ EXTRA_ARGS_SLUG=
+ case ${DATASET} in
+ TRAIN_IMDB=voc_2007_trainval
+ TEST_IMDB=voc_2007_test
+ ITERS=70000
++ date +%Y-%m-%d_%H-%M-%S
+ LOG=experiments/logs/test_vgg16_voc_2007_trainval_.txt.2017-02-15_15-23-01
+ exec
++ tee -a experiments/logs/test_vgg16_voc_2007_trainval_.txt.2017-02-15_15-23-01
tee: experiments/logs/test_vgg16_voc_2007_trainval_.txt.2017-02-15_15-23-01: No such file or directory
+ echo Logging output to experiments/logs/test_vgg16_voc_2007_trainval_.txt.2017-02-15_15-23-01
Logging output to experiments/logs/test_vgg16_voc_2007_trainval_.txt.2017-02-15_15-23-01
+ set +x
+ [[ ! -z '' ]]
+ CUDA_VISIBLE_DEVICES=0
+ time python ./tools/test_vgg16_net.py --imdb voc_2007_test --weight data/imagenet_weights/vgg16.weights --model output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt --cfg experiments/cfgs/vgg16.yml --set
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Called with args:
Namespace(cfg_file='experiments/cfgs/vgg16.yml', comp_mode=False, imdb_name='voc_2007_test', max_per_image=100, model='output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt', set_cfgs=[], tag='', weight='data/imagenet_weights/vgg16.weights')
Using config:
{'DATA_DIR': '/home/mona/computer_vision/tf-faster-rcnn/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'vgg16',
 'GPU_ID': 0,
 'MATLAB': 'matlab',
 'PIXEL_MEANS': array([[[ 102.9801,  115.9465,  122.7717]]]),
 'POOLING_MODE': 'crop',
 'RNG_SEED': 3,
 'ROOT_DIR': '/home/mona/computer_vision/tf-faster-rcnn',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'MODE': 'nms',
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'selective_search',
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'RPN_TOP_N': 5000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECT_GROUPING': False,
           'BATCH_SIZE': 256,
           'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'BIAS_DECAY': False,
           'DISPLAY': 20,
           'DOUBLE_BIAS': True,
           'FG_FRACTION': 0.25,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'LEARNING_RATE': 0.001,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_KEPT': 3,
           'SNAPSHOT_PREFIX': 'vgg16_faster_rcnn',
           'STEPSIZE': 30000,
           'SUMMARY_INTERVAL': 180,
           'TRUNCATED': False,
           'USE_FLIPPED': True,
           'USE_GT': False,
           'WEIGHT_DECAY': 0.0005},
 'USE_GPU_NMS': True}
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:03:00.0
Total memory: 11.92GiB
Free memory: 11.85GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:03:00.0)
Loading caffe weights...
Done!
Loading model check point from output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
Traceback (most recent call last):
  File "./tools/test_vgg16_net.py", line 94, in <module>
    saver.restore(sess, args.model)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1388, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
     [[Node: save/RestoreV2_30 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_30/tensor_names, save/RestoreV2_30/shape_and_slices)]]
     [[Node: save/RestoreV2_7/_135 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_51_save/RestoreV2_7", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op u'save/RestoreV2_30', defined at:
  File "./tools/test_vgg16_net.py", line 93, in <module>
    saver = tf.train.Saver()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1000, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1030, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 624, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 361, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 200, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
     [[Node: save/RestoreV2_30 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_30/tensor_names, save/RestoreV2_30/shape_and_slices)]]
     [[Node: save/RestoreV2_7/_135 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_51_save/RestoreV2_7", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Command exited with non-zero status 1
6.03user 4.28system 0:07.47elapsed 138%CPU (0avgtext+0avgdata 2083556maxresident)k
0inputs+32outputs (0major+219829minor)pagefaults 0swaps
endernewton commented 7 years ago

You need to download the trained model and create symolic links? Seems like it cannot find the model.

amirhfarzaneh commented 7 years ago

@monajalal I have the same problem. Nans appearing in my training. Have you fixed it?

yidan216home commented 7 years ago

@monajalal @amirhfarzaneh now I have the same problem,but it happened during iteration....Have you fixed it?I think I should check my training data and maybe there was null column in training data.....

amirhfarzaneh commented 7 years ago

@yidan216home I still have the problem on my GTX 980Ti gpu; but I have tested on a Quadro M4000 and a GTX 1080 and there is not a problem! What is your GPU?

tituslungu commented 7 years ago

@dandelionmane this seems to be a long-standing problem, occurring both for NaN's and Inf's. Can it be fixed?

zdm123 commented 7 years ago

@monajalal how did you figure the problem?

lonlonago commented 7 years ago

does anybody fixed the problem?

lonlonago commented 7 years ago

@monajalal , @zdm123 , @amirhfarzaneh, @yidan216home , I get the same problem with train my data , the rpn_box_loss is nan, after some research, it's because in the file 'pascal_voc.py', the function '_load_pascal_annotation' has Make pixel indexes 0-based,the code is : x1 = float(bbox.find('xmin').text) - 1 y1 = float(bbox.find('ymin').text) - 1 x2 = float(bbox.find('xmax').text) - 1 y2 = float(bbox.find('ymax').text) - 1 but if your data is not based 1, such as my data is based 0, then it will get -1 in the data, may be you can try to delete the -1 operation,hope helpful!

endernewton commented 7 years ago

you may need to adjust the hyperparameters (e.g. learning rate) if you are running on another dataset

summerrr commented 6 years ago

my loss is very low at the begin too,and do you know what reasons may cause this problem?

lander1003 commented 3 years ago

@monajalal , @zdm123 , @amirhfarzaneh, @yidan216home , I get the same problem with train my data , the rpn_box_loss is nan, after some research, it's because in the file 'pascal_voc.py', the function '_load_pascal_annotation' has Make pixel indexes 0-based,the code is : x1 = float(bbox.find('xmin').text) - 1 y1 = float(bbox.find('ymin').text) - 1 x2 = float(bbox.find('xmax').text) - 1 y2 = float(bbox.find('ymax').text) - 1 but if your data is not based 1, such as my data is based 0, then it will get -1 in the data, may be you can try to delete the -1 operation,hope helpful!

@lonlonago great,that solves my problem, thank you very much