argman / EAST

A tensorflow implementation of EAST text detector
GNU General Public License v3.0
3.01k stars 1.05k forks source link

可以正常训练,但是测试模型出现UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. #347

Open shakey-cuimiao opened 4 years ago

shakey-cuimiao commented 4 years ago

pciBusID: 0000:83:00.0 totalMemory: 10.76GiB freeMemory: 2.03GiB 2020-04-15 18:38:22.765241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2020-04-15 18:38:22.766743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-04-15 18:38:22.766773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2020-04-15 18:38:22.766789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2020-04-15 18:38:22.766950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1776 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:83:00.0, compute capability: 7.5) Restore from ./east_icdar2015_resnet_v1_50_rbox/model.ckpt-49491 WARNING:tensorflow:From /opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. Find 6 images 2020-04-15 18:38:30.125680: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2020-04-15 18:38:30.188806: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Traceback (most recent call last): File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node resnet_v1_50/conv1/Conv2D}}]] [[{{node feature_fusion/concat_3}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "eval.py", line 196, in tf.app.run() File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "eval.py", line 159, in main score, geometry = sess.run([f_score, f_geometry], feed_dict={input_images: [im_resized]}) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node resnet_v1_50/conv1/Conv2D (defined at /opt/shakey/deep-learning/EAST/nets/resnet_utils.py:122) ]] [[node feature_fusion/concat_3 (defined at /opt/shakey/deep-learning/EAST/model.py:80) ]]

Caused by op 'resnet_v1_50/conv1/Conv2D', defined at: File "eval.py", line 196, in tf.app.run() File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "eval.py", line 140, in main f_score, f_geometry = model.model(input_images, is_training=False) File "/opt/shakey/deep-learning/EAST/model.py", line 40, in model logits, end_points = resnet_v1.resnet_v1_50(images, is_training=is_training, scope='resnet_v1_50') File "/opt/shakey/deep-learning/EAST/nets/resnet_v1.py", line 252, in resnet_v1_50 reuse=reuse, scope=scope) File "/opt/shakey/deep-learning/EAST/nets/resnet_v1.py", line 193, in resnet_v1 net = resnet_utils.conv2d_same(net, 64, 7, stride=2, scope='conv1') File "/opt/shakey/deep-learning/EAST/nets/resnet_utils.py", line 122, in conv2d_same rate=rate, padding='VALID', scope=scope) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args return func(*args, current_args) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1155, in convolution2d conv_dims=2) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args return func(*args, *current_args) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1058, in convolution outputs = layer.apply(inputs) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1227, in apply return self.call(inputs, args, kwargs) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 530, in call outputs = super(Layer, self).call(inputs, *args, kwargs) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in call outputs = self.call(inputs, *args, *kwargs) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/keras/layers/convolutional.py", line 194, in call outputs = self._convolution_op(inputs, self.kernel) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 966, in call return self.conv_op(inp, filter) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 591, in call return self.call(inp, filter) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 208, in call name=self.name) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d data_format=data_format, dilations=dilations, name=name) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, kwargs) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/opt/shakey/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node resnet_v1_50/conv1/Conv2D (defined at /opt/shakey/deep-learning/EAST/nets/resnet_utils.py:122) ]] [[node feature_fusion/concat_3 (defined at /opt/shakey/deep-learning/EAST/model.py:80) ]]

pankSM commented 4 years ago

您好,您训练的时候是如何使用gpu的,我按照那个教程来,结果gpu 内存使用才60M,有空的时候烦劳给解答下,谢谢

unyxs281 commented 3 years ago

您好,我也遇到同样的问题,单个GPU可以训练,但是按照教程指定多个GPU就出现同样的错误。烦劳给解答下,谢谢。

unyxs281 commented 3 years ago

这个问题是因为gpu内存不够。

mohammedayub44 commented 3 years ago

@argman I get the same error. It started to train fine on CPU but since it was very slow, trying this on one GPU fails with the same stack trace. Is this really because of GPU memory or something else ? I tried it with --num_readers=1 and also setting --gpu_batch_size=1 running on g4dn (ec2) machine which have 16GB memory.

Any help appreciated !

mohammedayub44 commented 3 years ago

Looks like this was a CuDNN issue which was popping up in the log

2021-04-15 08:49:03.630044: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.5.1 but source was compiled with: 7.6.0.  CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2021-04-15 08:49:03.632954: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.5.1 but source was compiled with: 7.6.0.  CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

As it said I had 7.5.1 and the source was compiled on 7.6.0. After checking my cuda version with nvcc --version I did the conda install as follows which seemed to fix the issue -

conda install https://anaconda.org/anaconda/cudnn/7.6.0/download/linux-64/cudnn-7.6.0-cuda10.0_0.tar.bz2

The recommended way I think is to do the OS level changes from Nvidia, however I did not want to touch OS packages. After the conda install It picks up the cudnn runtime library first from the environment so it worked.

HimanchalChandra commented 3 years ago

You can use two method to avoid this situation:

  1. Allow growth: (more flexible): config = tf.ConfigProto() config.gpu_options.allow_growth = True session = tf.Session(config=config, ...)

  2. Allocate fixed memory: config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.4 session = tf.Session(config=config, ...)

I hope it helps!