balancap / SSD-Tensorflow

Single Shot MultiBox Detector in TensorFlow
4.11k stars 1.89k forks source link

UnknownError (see above for traceback): Failed to get convolution algorithm #348

Open CWF-999 opened 5 years ago

CWF-999 commented 5 years ago

Traceback (most recent call last): File "D:\Python3.5\lib\site-packages\tensorflow\python\training\supervisor.py", line 994, in managed_session yield sess File "D:\Python3.5\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 770, in train sess, train_op, global_step, train_step_kwargs) File "D:\Python3.5\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 487, in train_step run_metadata=run_metadata) File "D:\Python3.5\lib\site-packages\tensorflow\python\client\session.py", line 929, in run run_metadata_ptr) File "D:\Python3.5\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "D:\Python3.5\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run run_metadata) File "D:\Python3.5\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node ssd_300_vgg/conv1/conv1_1/Conv2D (defined at D:\pythoncode\SSD-Tensorflow-master\nets\ssd_vgg_300.py:463) ]] [[node gradients/AddN_42 (defined at D:\pythoncode\SSD-Tensorflow-master\deployment\model_deploy.py:265) ]]

Caused by op 'ssd_300_vgg/conv1/conv1_1/Conv2D', defined at: File "train_ssd_network.py", line 390, in tf.app.run() File "D:\Python3.5\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "train_ssd_network.py", line 291, in main clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue]) File "D:\pythoncode\SSD-Tensorflow-master\deployment\model_deploy.py", line 196, in create_clones outputs = model_fn(*args, kwargs) File "train_ssd_network.py", line 275, in clone_fn ssd_net.net(b_image, is_training=True) File "D:\pythoncode\SSD-Tensorflow-master\nets\ssd_vgg_300.py", line 158, in net scope=scope) File "D:\pythoncode\SSD-Tensorflow-master\nets\ssd_vgg_300.py", line 463, in ssd_net net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1') # VGG16网络的第一个conv,重复2次卷积,核为3x3,64个特征 File "D:\Python3.5\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 2613, in repeat outputs = layer(outputs, *args, *kwargs) File "D:\Python3.5\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args return func(args, current_args) File "D:\Python3.5\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1155, in convolution2d conv_dims=2) File "D:\Python3.5\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args return func(*args, current_args) File "D:\Python3.5\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1058, in convolution outputs = layer.apply(inputs) File "D:\Python3.5\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1227, in apply return self.call(inputs, *args, *kwargs) File "D:\Python3.5\lib\site-packages\tensorflow\python\layers\base.py", line 530, in call outputs = super(Layer, self).call(inputs, args, kwargs) File "D:\Python3.5\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 554, in call outputs = self.call(inputs, *args, *kwargs) File "D:\Python3.5\lib\site-packages\tensorflow\python\keras\layers\convolutional.py", line 194, in call outputs = self._convolution_op(inputs, self.kernel) File "D:\Python3.5\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 966, in call return self.conv_op(inp, filter) File "D:\Python3.5\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 591, in call return self.call(inp, filter) File "D:\Python3.5\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 208, in call name=self.name) File "D:\Python3.5\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 1113, in conv2d data_format=data_format, dilations=dilations, name=name) File "D:\Python3.5\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "D:\Python3.5\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func return func(args, **kwargs) File "D:\Python3.5\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op op_def=op_def) File "D:\Python3.5\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node ssd_300_vgg/conv1/conv1_1/Conv2D (defined at D:\pythoncode\SSD-Tensorflow-master\nets\ssd_vgg_300.py:463) ]] [[node ssd_losses/cross_entropy_pos/value (defined at D:\pythoncode\SSD-Tensorflow-master\nets\ssd_vgg_300.py:653) ]]

I checked the related issues, but a lot of reasons are the CUDA and cudnn versions do not match. However, I did not see similar problems when I ran other projects. It can also be run when eval_ssd_network.py. I will be grateful to anyone who can help me with the answer.

LiZeB commented 4 years ago

I have ever encountered this problem as well. Fortunately, I found a solution from this blog "https://blog.csdn.net/qq_41868689/article/details/98503069" . The thing you need to do is just replacing a line [352] in train_ssd_network.py script from gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.gpu_memory_fraction) to gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=FLAGS.gpu_memory_fraction, allow_growth=True). Then you will run the training script well.