Arturus / kaggle-web-traffic

1st place solution
MIT License
1.82k stars 667 forks source link

'CUDNN_STATUS_EXECUTION_FAILED' occurs #35

Open zhangmytf opened 4 years ago

zhangmytf commented 4 years ago

hi, when i run the code on my server ( v100*4 cuda 9.0 cudnn 7.0), it occurs this errors. Could you please help me ? which version of cuda and cudnn do you use?

`/home/admin/algomodule/test/kaggle-web-traffic# python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500 WARNING:tensorflow:From /home/admin/algomodule/test/kaggle-web-traffic/model.py:144: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead 2019-10-02 06:00:37.510047: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-10-02 06:00:37.909980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-10-02 06:00:37.911006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:00:08.0 totalMemory: 15.75GiB freeMemory: 15.44GiB 2019-10-02 06:00:38.047527: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-10-02 06:00:38.048568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:00:09.0 totalMemory: 15.75GiB freeMemory: 15.44GiB 2019-10-02 06:00:38.179680: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-10-02 06:00:38.180730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:00:0a.0 totalMemory: 15.75GiB freeMemory: 15.44GiB 2019-10-02 06:00:38.319747: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-10-02 06:00:38.320794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties: name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 pciBusID: 0000:00:0b.0 totalMemory: 15.75GiB freeMemory: 15.44GiB 2019-10-02 06:00:38.320867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3 2019-10-02 06:00:40.205535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-10-02 06:00:40.205600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2 3 2019-10-02 06:00:40.205610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y Y Y 2019-10-02 06:00:40.205616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N Y Y 2019-10-02 06:00:40.205631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: Y Y N Y 2019-10-02 06:00:40.205641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3: Y Y Y N 2019-10-02 06:00:40.205992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14941 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:08.0, compute capability: 7.0) 2019-10-02 06:00:40.508989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14941 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:09.0, compute capability: 7.0) 2019-10-02 06:00:40.811745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14941 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0a.0, compute capability: 7.0) 2019-10-02 06:00:41.114312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14941 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0b.0, compute capability: 7.0) 1: 0%| | 0/566 [00:00<?, ?it/s]2019-10-02 06:00:47.758076: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)' 2019-10-02 06:00:47.770054: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)' 2019-10-02 06:00:47.782300: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at cudnn_rnn_ops.cc:1214 : Unknown: CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)' Traceback (most recent call last): File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)' [[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]] [[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "trainer.py", line 786, in train(**param_dict) File "trainer.py", line 599, in train step = trainer.train_step(sess, epoch) File "trainer.py", line 251, in train_step results = self._metric_step(Stage.TRAIN, ops, sess, epoch, summary_every=20) File "trainer.py", line 235, in _metric_step results = sess.run(ops) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 877, in run run_metadata_ptr) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1100, in _run feed_dict_tensor, options, run_metadata) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run run_metadata) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)' [[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]] [[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'm_0/cudnn_gru/CudnnRNN', defined at: File "trainer.py", line 786, in train(param_dict) File "trainer.py", line 520, in train all_models.append(create_model(scope, i, prefix=prefix, seed=seed + i)) File "trainer.py", line 474, in create_model train_model = Model(pipe, hparams, is_train=True, graph_prefix=prefix, asgd_decay=asgd_decay, seed=seed) File "/home/admin/algomodule/test/kaggle-web-traffic/model.py", line 342, in init transpose_output=False) File "/home/admin/algomodule/test/kaggle-web-traffic/model.py", line 65, in make_encoder rnn_out, (rnn_state,) = cuda_model(inputs=rnn_time_input) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 362, in call outputs = super(Layer, self).call(inputs, *args, *kwargs) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in call outputs = self.call(inputs, args, kwargs) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 412, in call training) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 487, in _forward seed=self._seed) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 922, in _cudnn_rnn outputs, output_h, outputc, = gen_cudnn_rnn_ops.cudnn_rnn(*args) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 115, in cudnn_rnn is_training=is_training, name=name) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func return func(args, **kwargs) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op op_def=op_def) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in init self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(918): 'cudnnSetDropoutDescriptor( handle.get(), cudnn.handle(), dropout, state_memory.opaque(), state_memory.size(), seed)' [[Node: m_0/cudnn_gru/CudnnRNN = CudnnRNN[T=DT_FLOAT, direction="unidirectional", dropout=0.0304904226, input_mode="linear_input", is_training=true, rnn_mode="gru", seed=5, seed2=5, _device="/job:localhost/replica:0/task:0/device:GPU:0"](m_0/transpose, m_0/cudnn_gru/zeros, m_0/cudnn_gru/Const, m_0/cudnn_gru/opaque_kernel/read)]] [[Node: m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1/_165 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3276_m_2/weighted_loss/assert_broadcastable/AssertGuard/Assert/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]`

allen20200111 commented 4 years ago

do you fixed it ? i used tensorflow 1.14 cuda 10.0 cudnn 7.6 have same problem.