warp_ctc error in compute_ctc_loss

LearnedVector commented 5 years ago

Hey all, I am doing distributed training using tensorflow 1.12 and horovod 0.15.2 on 4 machines and 16 v100 GPUS on cuda 9.0 and cudnn 7.14 . It trains fine, but at a specific iterations would run into this weird error shown below.

Has anyone seen this specific error? It happening at the same iteration makes me suspicious it's something to do with the data. but to figure out what's wrong with the data i need to decrypt what this error message means internally inside warp_ctc. Any insight would be much appreciated!

Traceback (most recent call last):
  File "/home/ubuntu/deep-speech/tf_train.py", line 494, in <module>
    tf.app.run()
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/deep-speech/tf_train.py", line 491, in main
    run_training()
  File "/home/ubuntu/deep-speech/tf_train.py", line 405, in run_training
    is_training: True
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: warp_ctc error in compute_ctc_loss: unknown error
         [[node WarpCTC (defined at <string>:58)  = WarpCTC[blank_label=28, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transpose, boolean_mask/GatherV2/_1519, Squeeze_1, Squeeze)]]
         [[{{node gradients/AddN_80/_1853}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_20646_gradients/AddN_80", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op u'WarpCTC', defined at:
  File "/home/ubuntu/deep-speech/tf_train.py", line 494, in <module>
    tf.app.run()
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/deep-speech/tf_train.py", line 491, in main
    run_training()
  File "/home/ubuntu/deep-speech/tf_train.py", line 363, in run_training
    compile_train_op(train_inputs, train_targets, train_seq_len, train_label_lengths, is_training)
  File "/home/ubuntu/deep-speech/tf_train.py", line 299, in compile_train_op
    loss = tf.reduce_mean(warpctc_tensorflow.ctc(tf.cast(logits, tf.float32), targets, label_lengths, seq_len, blank_label=28))
  File "/home/ubuntu/mike.venv/lib/python2.7/site-packages/warpctc_tensorflow-0.1-py2.7-linux-x86_64.egg/warpctc_tensorflow/__init__.py", line 43, in ctc
    input_lengths, blank_label)
  File "<string>", line 58, in warp_ctc
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/ubuntu/mike.venv/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): warp_ctc error in compute_ctc_loss: unknown error
         [[node WarpCTC (defined at <string>:58)  = WarpCTC[blank_label=28, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transpose, boolean_mask/GatherV2/_1519, Squeeze_1, Squeeze)]]
         [[{{node gradients/AddN_80/_1853}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_20646_gradients/AddN_80", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

yetiancn commented 5 years ago

I have the same problem. Do you find any solution?

LearnedVector commented 5 years ago

@yetiancn unfortunately no I did not find a solution :/ instead I just switched over to use the tensorflow ctc implementation

yetiancn commented 5 years ago

I decide to try tensorflow ctc too. Thank you!

MichaelGou1105 commented 5 years ago

how to slove it ?

baidu-research / warp-ctc

warp_ctc error in compute_ctc_loss #133