HawkAaron / warp-transducer

A fast parallel implementation of RNN Transducer.
Apache License 2.0
307 stars 124 forks source link

CUDA error: an illegal memory access was encountered #61

Open FactoDeepLearning opened 4 years ago

FactoDeepLearning commented 4 years ago

Hello, I'm facing the following error when using your package. It appears randomly after some epochs. Do you have an idea about where it could come from ?

File "main_rnnt.py", line 86, in <module>
    model.train()
  File "/gpfs1/dlocal/run/7027505/pytorch/rnnt/RNNT.py", line 174, in train
    batch_metrics = self.train_batch(x, y)
  File "/gpfs1/dlocal/run/7027505/pytorch/rnnt/RNNT.py", line 286, in train_batch
    loss = loss_func(pred, y.permute(1, 0).contiguous(), x_len, y_len)
  File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/warprnnt_pytorch-0.1-py3.6-linux-x86_64.egg/warprnnt_pytorch/__init__.py", line 100, in forward
    return self.loss(acts, labels, act_lens, label_lens, self.blank, self.reduction)
  File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/warprnnt_pytorch-0.1-py3.6-linux-x86_64.egg/warprnnt_pytorch/__init__.py", line 40, in forward
    grads /= minibatch_size
RuntimeError: CUDA error: an illegal memory access was encountered

CentOS-7 CUDA 10.0 python 3.6.9 torch 1.2 gcc 7.3.0 GPU : Tesla P100-PCIE-12GB

LearnedVector commented 4 years ago

Getting the same. Any fix? @FactoDeepLearning @HawkAaron

EDIT This was due to me not putting acts, labels, input_len, and label_len to .cuda() in pytorch. Fix now.

EDIT2 I'm still getting it now. It'll train at first then get this error after X iterations.

LearnedVector commented 4 years ago

After some debugging, I think there might be a bug in this library @HawkAaron. I am printing the cost at this line here https://github.com/HawkAaron/warp-transducer/blob/master/pytorch_binding/warprnnt_pytorch/__init__.py#L37 and the RuntimeError: CUDA error: an illegal memory access was encountered only happens when cost is printing out as 0.. I am assuming that the loss_fn https://github.com/HawkAaron/warp-transducer/blob/master/pytorch_binding/warprnnt_pytorch/__init__.py#L27 is not updating the cost of gradients causing it to error out. Any ideas?

Also this issue is fixed when running on cpu and there are no 0 costs.

funcwj commented 4 years ago

Same issues.

jaesong commented 4 years ago

I think #64 will fix this issue.

housebaby commented 4 years ago

My version is latest. When using warp-transducer in espnet, the error still exist as "CUDA error: an illegal memory access was encountered". I discuss it in espnet project. But they think it is a problem of transducer.

https://github.com/espnet/espnet/issues/1860#issuecomment-651040485

My warp-transducer version is as follows. Merge: c1a265f 5098002 Author: Mingkun Huang mingkunhuang95@gmail.com Date: Mon Apr 27 23:07:35 2020 +0800

Merge pull request #66 from kamo-naoyuki/pt1.5

Support pytorch1.5
HawkAaron commented 4 years ago

@housebaby which kind of GPU did you use?

housebaby commented 4 years ago

@housebaby which kind of GPU did you use?

Tesla V100

It will not always fail. In some cases, either using 4 or 8 cards, it works. But when I just change the batch size of the successful case ( or learning_rate) , it fail. It is confusing

oshindow commented 4 years ago

Same issue. When the batchsizes=3, it passed. When the batchsizes is set higher, it failed.

jaesong commented 4 years ago

Oh, right, there's an overflow issue at compute_grad_kernel:

    // 0 <= col < batch * T * U
    int col = blockIdx.x;

    // col * alphabet_size can be > 2**31 - 1 = INT_MAX, but its type is int
    Tp logpk = denom[col] + acts[col * alphabet_size + idx];

cuda-memcheck seems to catch such problem with batch=1, src=53688, tgt=1+1, vocab=20000 (53688 2 20000 > INT_MAX). I also suspect that there are similar overflow issues at ReduceHelper, but I haven't checked them properly.

housebaby commented 4 years ago

Oh, right, there's an overflow issue at compute_grad_kernel:

    // 0 <= col < batch * T * U
    int col = blockIdx.x;

    // col * alphabet_size can be > 2**31 - 1 = INT_MAX, but its type is int
    Tp logpk = denom[col] + acts[col * alphabet_size + idx];

cuda-memcheck seems to catch such problem with batch=1, src=53688, tgt=1+1, vocab=20000 (53688 2 20000 > INT_MAX). I also suspect that there are similar overflow issues at ReduceHelper, but I haven't checked them properly.

Cool . Then how should we solve this overflow problem. And will modification on this problem be updated to warp-transducer soon? @HawkAaron @jaesong

stefan-falk commented 3 years ago

I don't know if this is related but after upgrading to Tensorflow 2.5.0 (and therefore to CUDA 11.1) I am seeing this when training RNN-based transducer models. The loss either gets nan or I see the following error:

2021-06-17 17:23:44.905116: E tensorflow/stream_executor/dnn.cc:729] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1990): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2021-06-17 17:23:44.905169: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cudnn_rnn_ops.cc:1560 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 768, 768, 1, 29, 41, 768]
2021-06-17 17:23:44.906664: I tensorflow/stream_executor/stream.cc:1404] [stream=0x55774c2eb680,impl=0x5577394acab0] did not wait for [stream=0x55774c2eb410,impl=0x5577266661f0]
2021-06-17 17:23:44.906810: E tensorflow/stream_executor/cuda/cuda_driver.cc:1085] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906826: E tensorflow/stream_executor/cuda/cuda_driver.cc:1085] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906841: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906859: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:721] failed to record completion event; therefore, failed to create inter-stream dependency
2021-06-17 17:23:44.906872: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906888: E tensorflow/stream_executor/stream.cc:334] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2021-06-17 17:23:44.906903: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
2021-06-17 17:23:44.906911: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906920: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fec7589a700; host src: 0x7fec55458200; size: 4=0x4
2021-06-17 17:23:44.906934: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Fatal Python error: Aborted2021-06-17 17:23:44.906946: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fed1b838100; host src: 0x7fe28e26b040; size: 24531156=0x17650d4

Thread 0x00007fec57a63700 (most recent call first):
  File "2021-06-17 17:23:44.906960: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fecaa6b1b00; host src: 0x7fec55457a00; size: 164=0xa4
/home/sfalk2021-06-17 17:23:44.906974: E tensorflow/stream_executor/cuda/cuda_driver.cc:1182] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x7fec5545af00; GPU src: 0x7fe75f100d00; size: 31980=0x7cec
/minicon2021-06-17 17:23:44.906987: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
da3/Fatal Python error: eAborted nvs/asr2/lib/python3.8/multiprocessingFatal Python error: /Abortedpool.py"Aborted (core dumped)

It's possible that this has nothing to do with https://github.com/HawkAaron/warp-transducer but it's the only external library I am using in combination with Tensorflow.

See also https://github.com/tensorflow/tensorflow/issues/50326

yufang67 commented 2 years ago

Hi @stefan-falk, did you resolve the issue ? i have similar problem with tf 2.8.2 + cuda11.2 + warp+rnnt. issue occurs only on multiGPU