CUDNN_STATUS_EXECUTION_FAILED

Johndirr commented 4 years ago

I'm running the text detection and the recognition on every frame of a video to extract hardcoded subtitles (on windows). This works quite well although its a bit slow. But letting my program run for some minutes (the time differs) I always get this error: CUDNN_STATUS_EXECUTION_FAILED I don't think its a bug of keras-ocr but I don't have a clue how to resolve this error or were to ask. From what I found by searching the internet it could be a driver issue... Any idea?

Here is the full log:

2020-04-09 17:08:52.593789: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(1796): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())' 2020-04-09 17:08:52.605393: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at cudnn_rnn_ops.cc:1498 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 128, 128, 1, 50, 4, 128] 2020-04-09 17:08:52.612678: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 128, 128, 1, 50, 4, 128] [[{{node CudnnRNN}}]] 2020-04-09 17:08:52.621696: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Cancelled: [Derived]RecvAsync is cancelled. [[{{node decode/PadV2/paddings/_78}}]] [[decode/Shape_1/_76]] 2020-04-09 17:08:52.624991: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Cancelled: [Derived]RecvAsync is cancelled. [[{{node decode/PadV2/paddings/_78}}]] Traceback (most recent call last): File "VideoSubDetect.py", line 199, in recognizedtext = recognizer.recognize_from_boxes([frame], [sorted_box_group]) File "C:\Users\RetroHelix\Envs\test\lib\site-packages\keras_ocr\recognition.py", line 439, in recognize_from_boxes for row in self.prediction_model.predict(X, *kwargs) File "C:\Users\RetroHelix\Envs\test\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 909, in predict use_multiprocessing=use_multiprocessing) File "C:\Users\RetroHelix\Envs\test\lib\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 722, in predict callbacks=callbacks) File "C:\Users\RetroHelix\Envs\test\lib\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 393, in model_iteration batch_outs = f(ins_batch) File "C:\Users\RetroHelix\Envs\test\lib\site-packages\tensorflow_core\python\keras\backend.py", line 3740, in call outputs = self._graph_fn(converted_inputs) File "C:\Users\RetroHelix\Envs\test\lib\site-packages\tensorflow_core\python\eager\function.py", line 1081, in call return self._call_impl(args, kwargs) File "C:\Users\RetroHelix\Envs\test\lib\site-packages\tensorflow_core\python\eager\function.py", line 1121, in _call_impl return self._call_flat(args, self.captured_inputs, cancellation_manager) File "C:\Users\RetroHelix\Envs\test\lib\site-packages\tensorflow_core\python\eager\function.py", line 1224, in _call_flat ctx, args, cancellation_manager=cancellation_manager) File "C:\Users\RetroHelix\Envs\test\lib\site-packages\tensorflow_core\python\eager\function.py", line 511, in call ctx=ctx) File "C:\Users\RetroHelix\Envs\test\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.CancelledError: [Derived]RecvAsync is cancelled. [[{{node decode/PadV2/paddings/_78}}]] [Op:__inference_keras_scratch_graph_15223]

Johndirr commented 4 years ago

EDIT3: I upgraded CUDA to 10.1 and tensorflow to the most recent version. I get this error now (after around 40 minutes):

2020-04-13 20:00:27.823843: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure 2020-04-13 20:00:27.828449: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

francotheengineer commented 4 years ago

Can you try running this in docker? You will only need the nvidia driver installed. What platform are you on? https://www.tensorflow.org/install/docker#gpu_support

Johndirr commented 4 years ago

After also updating cuDNN everything seems to run fine :) But thanks for help.

faustomorales / keras-ocr

CUDNN_STATUS_EXECUTION_FAILED #62