askforalfred / alfred

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
MIT License
375 stars 84 forks source link

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #26

Open PeterAJansen opened 4 years ago

PeterAJansen commented 4 years ago

Hi,

I'm seeing the same error as another person posted --

(alfred_env) (base) peter@neutronium:~/github/alfred$ python models/train/train_seq2seq.py --data data/json_feat_2.1.0 --model seq2seq_im_mask --dout exp/model:{model},name:pm_and_subgoals_01 --splits data/splits/oct21.json --gpu --batch 8 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 Namespace(action_loss_wt=1.0, actor_dropout=0.0, attn_dropout=0.0, batch=8, data='data/json_feat_2.1.0', dataset_fraction=0, dec_teacher_forcing=False, decay_epoch=10, demb=100, dframe=2500, dhid=512, dout='exp/model:seq2seq_im_mask,name:pm_and_subgoals_01', epoch=20, fast_epoch=False, gpu=True, hstate_dropout=0.3, input_dropout=0.0, lang_dropout=0.0, lr=0.0001, mask_loss_wt=1.0, model='seq2seq_im_mask', pframe=300, pm_aux_loss_wt=0.1, pp_folder='pp', preprocess=False, resume=None, save_every_epoch=False, seed=123, splits='data/splits/oct21.json', subgoal_aux_loss_wt=0.1, temp_no_history=False, vis_dropout=0.3, zero_goal=False, zero_instr=False) {'tests_seen': 1533, 'tests_unseen': 1529, 'train': 21023, 'valid_seen': 820, 'valid_unseen': 821} Traceback (most recent call last): File "models/train/train_seq2seq.py", line 103, in <module> model = model.to(torch.device('cuda')) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 386, in to return self._apply(convert) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I have verified that I've followed the installation instructions, that that the correct versions of torch (1.1.0), Torchvision (0.3.0 in requirements.txt; the prose says 1.3.0 but the latest version is 0.6.0), AI2THOR (2.1.0), and tensorboardX (1.8) have been installed.

I'm using a Titan RTX and CUDA 10.1 on KUbuntu 18.04.

Model seems to start training without the --gpu option, but it appears slow (so I didn't wait to see how long it would take).

thanks!

MohitShridhar commented 4 years ago

@PeterAJansen can you try a smaller batch size? Something less than 8?

PeterAJansen commented 4 years ago

@MohitShridhar I forgot to mention this too -- smaller batch sizes produced the same error. The Titan RTX has 24gb of RAM, hopefully plenty for moderate batch sizes.

MohitShridhar commented 4 years ago

Ah I see. Have you seen this? This error is being thrown by the PyTorch RNN module, so I am not sure what's happening here.

It seems like you need to build PyTorch with the right CUDA version?

SouLeo commented 4 years ago

@PeterAJansen did you make any progress on this? I just purchased a RTX 2080S, performed a fresh install of Ubuntu 18.04, downloaded the recommended pytorch version (1.5.1), and my CUDA version is 10.2. Despite all this effort, I still get the same error as you.

PeterAJansen commented 4 years ago

Unfortunately no luck on my end, I was never able to get this running. If you do figure it out, please post the solution to this thread -- I'd be eager to give it a try.

On Mon, Jul 20, 2020 at 12:02 PM Selma Wanna notifications@github.com wrote:

External Email

@PeterAJansen https://github.com/PeterAJansen did you make any progress on this? I just purchased a RTX 2080S, performed a fresh install of Ubuntu 18.04, downloaded the recommended pytorch version (1.5.1), and my CUDA version is 10.2. Despite all this effort, I still get the same error as you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/askforalfred/alfred/issues/26#issuecomment-661275395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5C7FDTSAUWHH7PQQNRU4LR4SICRANCNFSM4NFU2HQQ .

-- Peter Jansen, PhD Assistant Professor, School of Information, University of Arizona web: http://cognitiveai.org

MohitShridhar commented 4 years ago

Sorry, I wish I could help, but I don't have a RTX 2080S to debug this.

SouLeo commented 4 years ago

No worries! I think I figured out that it might be an OOM issue. I ran it a couple of times on my 8GB GPU and saw that the training program nearly used all 8 GB.

Then after rerunning the training and changing absolutely nothing regarding the training program, It was able to run (and it has been running for at least 11 hours.)

I’m betting I just got lucky, and I’ll be searching for cloud compute resources for the future.

PeterAJansen commented 4 years ago

@SouLeo I'm working with a Titan RTX with 24gb of memory, and was getting the error even with batch sizes of 1, so I don't think it was an out-of-memory issue in my case -- in case that helps you figure out what the issue ultimately was.

kolbytn commented 4 years ago

Potential Fix

I was running into the same issue. Ubuntu 18.04, Cuda 10.2, Titan RTX 24GB. I followed the quick install instructions. Error happened almost immediately. Smaller batch sizes did'nt help. Running without --gpu worked. Command: CUDA_VISIBLE_DEVICES=1 python models/train/train_seq2seq.py --data data/json_feat_2.1.0 --model seq2seq_im_mask --dout exp/model:{model},name:pm_and_subgo als_01 --splits data/splits/oct21.json --gpu --batch 2 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 --preprocess Output:

Namespace(action_loss_wt=1.0, actor_dropout=0.0, attn_dropout=0.0, batch=8, data='data/json_feat_2.1.0', dataset_fraction=0, dec_teacher_forcing=False, decay_epoch=10, demb=100, dframe=2500, dhid=512, dout='exp/model:seq2seq_im_mask,name:pm_and_subgoals_01', epoch=20, fast_epoch=False, gpu=True, hstate_dropout=0.3, input_dropout=0.0, lang_dropout=0.0, lr=0.0001, mask_loss_wt=1.0, model='seq2seq_im_mask', pframe=300, pm_aux_loss_wt=0.1, pp_folder='pp', preprocess=False, resume=None, save_every_epoch=False, seed=123, splits='data/splits/oct21.json', subgoal_aux_loss_wt=0.1, temp_no_history=False, vis_dropout=0.3, zero_goal=False, zero_instr=False)
{'tests_seen': 1533,
 'tests_unseen': 1529,
 'train': 21023,
 'valid_seen': 820,
 'valid_unseen': 821}
Traceback (most recent call last):
  File "models/train/train_seq2seq.py", line 103, in <module>
    model = model.to(torch.device('cuda'))
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/module.py", line 386, in to
    return self._apply(convert)
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
    self.flatten_parameters()
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I uninstalled the versions of torch and torchvision specified in requirements.txt and instead installed latest versions. Everything seems to be working fine now. Is this a legitimate fix or will I run into issues using the latest pytorch with other parts of the repo?

MohitShridhar commented 4 years ago

Well... without --gpu you are training on CPU, which would be very slow.

kolbytn commented 4 years ago

Sorry if I wasn't clear. I was stating that it does work while running on the cpu to point out that it is a cuda/gpu issue.

I fixed my issue by upgrading torch to the latest version instead of the version specified by requirements.txt. I want to know if there is another reason requirements.txt uses torch 1.1.0 and if anything will break if I use torch version 1.6.0.

MohitShridhar commented 4 years ago

Yeah, I figure there might be some API updates in torch 1.6.0 that might break the code. Especially with GPU training.

dnandha commented 3 years ago

Getting the same error with the Docker image on RTX 2080. Could be that this card is not supported by torch==1.1.0?

MohitShridhar commented 3 years ago

@dnandha the seq2seq baselines are a bit outdated now. Checkout the SoTA models that use newer torch versions: https://github.com/askforalfred/alfred#sota-models