Open monajalal opened 7 years ago
The zero losses look suspicious, and the loss should not be this low in the first iterations. Could you check if the compilation on k40 is the issue? BTW the code has updated a bit now, maybe you want to refork it.
This is what happened after I did git pull and ran the training:
mona@pascal:~/computer_vision/tf-faster-rcnn$ git pull
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/endernewton/tf-faster-rcnn
9731cc0..83bc041 master -> origin/master
Updating 9731cc0..83bc041
Fast-forward
README.md | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
mona@pascal:~/computer_vision/tf-faster-rcnn$ GPU_ID=0
mona@pascal:~/computer_vision/tf-faster-rcnn$ ./experiments/scripts/test_vgg16.sh $GPU_ID pascal_voc
+ set -e
+ export PYTHONUNBUFFERED=True
+ PYTHONUNBUFFERED=True
+ GPU_ID=0
+ DATASET=pascal_voc
+ array=($@)
+ len=2
+ EXTRA_ARGS=
+ EXTRA_ARGS_SLUG=
+ case ${DATASET} in
+ TRAIN_IMDB=voc_2007_trainval
+ TEST_IMDB=voc_2007_test
+ ITERS=70000
++ date +%Y-%m-%d_%H-%M-%S
+ LOG=experiments/logs/test_vgg16_voc_2007_trainval_.txt.2017-02-15_15-23-01
+ exec
++ tee -a experiments/logs/test_vgg16_voc_2007_trainval_.txt.2017-02-15_15-23-01
tee: experiments/logs/test_vgg16_voc_2007_trainval_.txt.2017-02-15_15-23-01: No such file or directory
+ echo Logging output to experiments/logs/test_vgg16_voc_2007_trainval_.txt.2017-02-15_15-23-01
Logging output to experiments/logs/test_vgg16_voc_2007_trainval_.txt.2017-02-15_15-23-01
+ set +x
+ [[ ! -z '' ]]
+ CUDA_VISIBLE_DEVICES=0
+ time python ./tools/test_vgg16_net.py --imdb voc_2007_test --weight data/imagenet_weights/vgg16.weights --model output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt --cfg experiments/cfgs/vgg16.yml --set
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Called with args:
Namespace(cfg_file='experiments/cfgs/vgg16.yml', comp_mode=False, imdb_name='voc_2007_test', max_per_image=100, model='output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt', set_cfgs=[], tag='', weight='data/imagenet_weights/vgg16.weights')
Using config:
{'DATA_DIR': '/home/mona/computer_vision/tf-faster-rcnn/data',
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXP_DIR': 'vgg16',
'GPU_ID': 0,
'MATLAB': 'matlab',
'PIXEL_MEANS': array([[[ 102.9801, 115.9465, 122.7717]]]),
'POOLING_MODE': 'crop',
'RNG_SEED': 3,
'ROOT_DIR': '/home/mona/computer_vision/tf-faster-rcnn',
'TEST': {'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'MODE': 'nms',
'NMS': 0.3,
'PROPOSAL_METHOD': 'selective_search',
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'RPN_TOP_N': 5000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'ASPECT_GROUPING': False,
'BATCH_SIZE': 256,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'BIAS_DECAY': False,
'DISPLAY': 20,
'DOUBLE_BIAS': True,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'GAMMA': 0.1,
'HAS_RPN': True,
'IMS_PER_BATCH': 1,
'LEARNING_RATE': 0.001,
'MAX_SIZE': 1000,
'MOMENTUM': 0.9,
'PROPOSAL_METHOD': 'gt',
'RPN_BATCHSIZE': 256,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALES': [600],
'SNAPSHOT_ITERS': 5000,
'SNAPSHOT_KEPT': 3,
'SNAPSHOT_PREFIX': 'vgg16_faster_rcnn',
'STEPSIZE': 30000,
'SUMMARY_INTERVAL': 180,
'TRUNCATED': False,
'USE_FLIPPED': True,
'USE_GT': False,
'WEIGHT_DECAY': 0.0005},
'USE_GPU_NMS': True}
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:03:00.0
Total memory: 11.92GiB
Free memory: 11.85GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:03:00.0)
Loading caffe weights...
Done!
Loading model check point from output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
W tensorflow/core/framework/op_kernel.cc:975] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
Traceback (most recent call last):
File "./tools/test_vgg16_net.py", line 94, in <module>
saver.restore(sess, args.model)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1388, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
[[Node: save/RestoreV2_30 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_30/tensor_names, save/RestoreV2_30/shape_and_slices)]]
[[Node: save/RestoreV2_7/_135 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_51_save/RestoreV2_7", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Caused by op u'save/RestoreV2_30', defined at:
File "./tools/test_vgg16_net.py", line 93, in <module>
saver = tf.train.Saver()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1000, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1030, in build
restore_sequentially=self._restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 624, in build
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 361, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 200, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2
dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt
[[Node: save/RestoreV2_30 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_30/tensor_names, save/RestoreV2_30/shape_and_slices)]]
[[Node: save/RestoreV2_7/_135 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_51_save/RestoreV2_7", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Command exited with non-zero status 1
6.03user 4.28system 0:07.47elapsed 138%CPU (0avgtext+0avgdata 2083556maxresident)k
0inputs+32outputs (0major+219829minor)pagefaults 0swaps
You need to download the trained model and create symolic links? Seems like it cannot find the model.
@monajalal I have the same problem. Nans appearing in my training. Have you fixed it?
@monajalal @amirhfarzaneh now I have the same problem,but it happened during iteration....Have you fixed it?I think I should check my training data and maybe there was null column in training data.....
@yidan216home I still have the problem on my GTX 980Ti gpu; but I have tested on a Quadro M4000 and a GTX 1080 and there is not a problem! What is your GPU?
@dandelionmane this seems to be a long-standing problem, occurring both for NaN's and Inf's. Can it be fixed?
@monajalal how did you figure the problem?
does anybody fixed the problem?
@monajalal , @zdm123 , @amirhfarzaneh, @yidan216home , I get the same problem with train my data , the rpn_box_loss is nan, after some research, it's because in the file 'pascal_voc.py', the function '_load_pascal_annotation' has Make pixel indexes 0-based,the code is : x1 = float(bbox.find('xmin').text) - 1 y1 = float(bbox.find('ymin').text) - 1 x2 = float(bbox.find('xmax').text) - 1 y2 = float(bbox.find('ymax').text) - 1 but if your data is not based 1, such as my data is based 0, then it will get -1 in the data, may be you can try to delete the -1 operation,hope helpful!
you may need to adjust the hyperparameters (e.g. learning rate) if you are running on another dataset
my loss is very low at the begin too,and do you know what reasons may cause this problem?
@monajalal , @zdm123 , @amirhfarzaneh, @yidan216home , I get the same problem with train my data , the rpn_box_loss is nan, after some research, it's because in the file 'pascal_voc.py', the function '_load_pascal_annotation' has Make pixel indexes 0-based,the code is : x1 = float(bbox.find('xmin').text) - 1 y1 = float(bbox.find('ymin').text) - 1 x2 = float(bbox.find('xmax').text) - 1 y2 = float(bbox.find('ymax').text) - 1 but if your data is not based 1, such as my data is based 0, then it will get -1 in the data, may be you can try to delete the -1 operation,hope helpful!
@lonlonago great,that solves my problem, thank you very much
I followed your instruction and got this error. Can you please suggest solutions?