NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 372 forks source link

cuDNN failed to initialize #506

Closed jax79sg closed 4 years ago

jax79sg commented 4 years ago

Hi,

I am trying to run Jasper training on Librespeech but encountered the following issues.

(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node ForwardPass/w2l_encoder/conv11/conv1d (defined at /home/workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:205) ]]
     [[Loss_Optimization/Select_266/_9961]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node ForwardPass/w2l_encoder/conv11/conv1d (defined at /home/workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:205) ]]
0 successful operations.
0 derived errors ignored.

The code will hang at the end of the following stack trace with the GPU mem locked but no activity. Haven't been able to resolve it by changing versions of cuda and cudnn. Is there a specific version of TF, Cuda and Cudnn required? Or am i missing something?

2019-10-05 15:56:00.254385: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-05 15:56:02.937668: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-05 15:56:04.386492: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-10-05 15:56:04.390917: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node ForwardPass/w2l_encoder/conv11/conv1d}}]]
     [[Loss_Optimization/Select_266/_9961]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node ForwardPass/w2l_encoder/conv11/conv1d}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 104, in <module>
    main()
  File "run.py", line 90, in main
    model, eval_model=None, debug_port=args.debug_port, custom_hooks=hooks)
  File "/home/workspace/OpenSeq2Seq/open_seq2seq/utils/funcs.py", line 184, in train
    fetches_vals = sess.run(fetches, feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
    raise six.reraise(*original_exc_info)
  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1411, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node ForwardPass/w2l_encoder/conv11/conv1d (defined at /home/workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:205) ]]
     [[Loss_Optimization/Select_266/_9961]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node ForwardPass/w2l_encoder/conv11/conv1d (defined at /home/workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:205) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'ForwardPass/w2l_encoder/conv11/conv1d':
  File "run.py", line 104, in <module>
    main()
  File "run.py", line 79, in main
    args, base_config, config_module, base_model, hvd, checkpoint)
  File "/home/workspace/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 874, in create_model
    model.compile()
  File "/home/workspace/OpenSeq2Seq/open_seq2seq/models/model.py", line 415, in compile
    gpu_id=gpu_cnt
  File "/home/workspace/OpenSeq2Seq/open_seq2seq/models/speech2text.py", line 173, in _build_forward_pass_graph
    encoder_output = self.encoder.encode(input_dict=encoder_input)
  File "/home/workspace/OpenSeq2Seq/open_seq2seq/encoders/encoder.py", line 138, in encode
    return self._encode(self._cast_types(input_dict))
  File "/home/workspace/OpenSeq2Seq/open_seq2seq/encoders/tdnn_encoder.py", line 252, in _encode
    **normalization_params
  File "/home/workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py", line 205, in conv_bn_actv
    data_format=data_format,
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/convolutional.py", line 218, in conv1d
    return layer.apply(inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1479, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py", line 537, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 634, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 146, in wrapper
    ), args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 446, in converted_call
    return _call_unconverted(f, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 253, in _call_unconverted
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 373, in call
    return super(Conv1D, self).call(inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 196, in call
    outputs = self._convolution_op(inputs, self.kernel)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 1079, in __call__
    return self.conv_op(inp, filter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 635, in __call__
    return self.call(inp, filter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 234, in __call__
    name=self.name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 223, in _conv1d
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 1624, in conv1d
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1071, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()
jax79sg commented 4 years ago

Hi, managed to figure out what could be the issue and applied some workarounds in the code.

Potential problem: RTX cards tend to have this issue, it lies in Tensorflow, rather than the application code. As of this comment, this wasn't fixed in the latest TF 1.14, CUDA 10.1 and CUDNN 7.6.

Workaround: The workaround for this is to ensure the following 2 GPU options are set in TF sessions.

gpu_options.allow_growth=True
gpu_options.per_process_gpu_memory_fraction = x  #x, a fraction, needs to be experimented on individual cards.

For Jasper codes, i've implemented a stop gap script to deal with it.

sed -i.bak '579i\    tf_config.gpu_options.per_process_gpu_memory_fraction = 0.8\' /home/workspace/OpenSeq2Seq/open_seq2seq/models/model.py
sed -i.bak '35i\  sess_config.gpu_options.per_process_gpu_memory_fraction = 0.8\' /home/workspace/OpenSeq2Seq/open_seq2seq/utils/funcs.py
sed -i.bak '229i\  sess_config.gpu_options.per_process_gpu_memory_fraction = 0.8\' /home/workspace/OpenSeq2Seq/open_seq2seq/utils/funcs.py