RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

OswaldoBornemann commented 5 years ago

When i ran the train.py , it just show that RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. But my pytorch version is v1.0.1

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1549630534704/work/torch/csrc/generic/serialization.cpp line=23 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "train.py", line 438, in <module>
    train_loop(device, model, data_loader, optimizer, checkpoint_dir)
  File "train.py", line 319, in train_loop
    loss.backward()
  File "/home/data/anaconda3/envs/tensor-torch/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/data/anaconda3/envs/tensor-torch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 444, in <module>
    save_checkpoint(device, model, optimizer, global_step, checkpoint_dir, global_epoch)
  File "train.py", line 189, in save_checkpoint
    }, checkpoint_path)
  File "/home/data/anaconda3/envs/tensor-torch/lib/python3.6/site-packages/torch/serialization.py", line 219, in save
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/data/anaconda3/envs/tensor-torch/lib/python3.6/site-packages/torch/serialization.py", line 144, in _with_file_like
    return body(f)
  File "/home/data/anaconda3/envs/tensor-torch/lib/python3.6/site-packages/torch/serialization.py", line 219, in <lambda>
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/data/anaconda3/envs/tensor-torch/lib/python3.6/site-packages/torch/serialization.py", line 298, in _save
    serialized_storages[key]._write_file(f, _should_read_directly(f))
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1549630534704/work/torch/csrc/generic/serialization.cpp:23

geneing commented 5 years ago

I haven't seen this error. It's strange error - it happens during saving of the model.

I'm using pytorch 1.0.1.post2, cuda 10. Maybe you are running out of memory on the GPU. Maybe try reducing batch size. Also, check that checkpoint path exists and writeable.

OswaldoBornemann commented 5 years ago

@geneing thanks my friend. Now the training is going on.

maozhiqiang commented 5 years ago

hi @tsungruihon ! How did you solve the problem? Which version of pytorch and CUDA are you using? thanks !

OswaldoBornemann commented 5 years ago

@maozhiqiang pytorch 1.0.1.post2 and cuda 10

maozhiqiang commented 5 years ago

thank you!

OswaldoBornemann commented 5 years ago

@maozhiqiang i noticed you have used Mozilla TTS on Chinese corpus and get good results. Recently i used Mozilla TTS too. Could i communicate with you with email ?

maozhiqiang commented 5 years ago

@tsungruihon It's a pleasure to talk with you！ my email is: z_q_mao@163.com

acrosson commented 5 years ago

@maozhiqiang did updating to pytorch 1.0.1.post2 work?

I'm getting the same error on Pytorch 1.1.0 Cuda 10.1

acrosson commented 5 years ago

I've decreased the batch size to 64. Got same error. It's not a OOM error, I've got 16GB of memory. Changed checkpoint folders permissions, so shouldn't be a permissions issue. Any other suggestions @geneing

geneing commented 5 years ago

@acrosson @tsungruihon Please try the newly committed code. I fixed a quantization issue which was generating similar error to the one you are seeing. Make sure that the input audio files contain values that are in the range [-1, 1].

OswaldoBornemann commented 5 years ago

thanks a lot.!

geneing notifications@github.com 于2019年5月10日周五下午12:58写道：

@acrosson https://github.com/acrosson @tsungruihon https://github.com/tsungruihon Please try the newly committed code. I fixed a quantization issue which was generating similar error to the one you are seeing. Make sure that the input audio files contain values that are in the range [-1, 1].

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneing/WaveRNN-Pytorch/issues/2#issuecomment-491156116, or mute the thread https://github.com/notifications/unsubscribe-auth/ACYV6C3BFVEPG2LJMFEQZBLPUT6IFANCNFSM4G5NEFAA .

acrosson commented 5 years ago

@tsungruihon did this work for you? I got the same error, even after pulling down the latest code.

I didn't normalize, like @geneing suggested, maybe that's the issue?

OswaldoBornemann commented 5 years ago

@acrosson i haven't tried the latest code, busy on TTS now. :sob:

geneing / WaveRNN-Pytorch

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #2