RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

McC0dy commented 5 years ago

When training using any of the example configurations from the documentation I get the error: "RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED"

Reproducing For example running: python main.py --network_type rnn --dataset wikitext

My system configuration CUDA 10.1 Python 3.7.3 PyTorch 1.1.0 Arch Linux GPU: RTX 2070

Other PyTorch applications work just fine.

Full output (from pipenv environment):

% python main.py --network_type rnn --dataset wikitext                                                                    oliver@oliver
2019-06-14 16:30:31,585:INFO::[*] Make directories : logs/wikitext_2019-06-14_16-30-31
2019-06-14 16:30:49,909:INFO::regularizing:
2019-06-14 16:30:54,743:INFO::# of parameters: 169,315,278
2019-06-14 16:30:54,834:INFO::[*] MODEL dir: logs/wikitext_2019-06-14_16-30-31
2019-06-14 16:30:54,834:INFO::[*] PARAM path: logs/wikitext_2019-06-14_16-30-31/params.json
Traceback (most recent call last):
  File "main.py", line 54, in <module>
    main(args)
  File "main.py", line 34, in main
    trnr.train()
  File "/home/oliver/code/ENAS-pytorch/trainer.py", line 222, in train
    self.train_shared(dag=dag)
  File "/home/oliver/code/ENAS-pytorch/trainer.py", line 305, in train_shared
    dags)
  File "/home/oliver/code/ENAS-pytorch/trainer.py", line 251, in get_loss
    output, hidden, extra_out = self.shared(inputs, dag, hidden=hidden)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/oliver/code/ENAS-pytorch/models/shared_rnn.py", line 235, in forward
    logit, hidden = self.cell(x_t, hidden, dag)
  File "/home/oliver/code/ENAS-pytorch/models/shared_rnn.py", line 354, in cell
    output = self.batch_norm(output)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/oliver/.local/share/virtualenvs/ENAS-pytorch-kjHs_kjH/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Debugging Debugging the parameters passed to batch_norm I found that the following parameters are all on cuda-device: input, weight, bias, running_mean, running_var. Which is all reasonable. The remaining vars are reasonable as well.

lorenzoviva commented 5 years ago

Had same problem, the pytorch most widely used for NAS-related github repositories is 0.3.1 sometimes 0.2. I suggest you to try a downgrade.

carpedm20 commented 5 years ago

I think you should use v0.3.1 (links) which was released on Feb 13, 2018 because my initial commit was on Feb 14, 2018.

carpedm20 / ENAS-pytorch

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #44