Open PeterAJansen opened 4 years ago
@PeterAJansen can you try a smaller batch size? Something less than 8?
@MohitShridhar I forgot to mention this too -- smaller batch sizes produced the same error. The Titan RTX has 24gb of RAM, hopefully plenty for moderate batch sizes.
Ah I see. Have you seen this? This error is being thrown by the PyTorch RNN module, so I am not sure what's happening here.
It seems like you need to build PyTorch with the right CUDA version?
@PeterAJansen did you make any progress on this? I just purchased a RTX 2080S, performed a fresh install of Ubuntu 18.04, downloaded the recommended pytorch version (1.5.1), and my CUDA version is 10.2. Despite all this effort, I still get the same error as you.
Unfortunately no luck on my end, I was never able to get this running. If you do figure it out, please post the solution to this thread -- I'd be eager to give it a try.
On Mon, Jul 20, 2020 at 12:02 PM Selma Wanna notifications@github.com wrote:
External Email
@PeterAJansen https://github.com/PeterAJansen did you make any progress on this? I just purchased a RTX 2080S, performed a fresh install of Ubuntu 18.04, downloaded the recommended pytorch version (1.5.1), and my CUDA version is 10.2. Despite all this effort, I still get the same error as you.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/askforalfred/alfred/issues/26#issuecomment-661275395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5C7FDTSAUWHH7PQQNRU4LR4SICRANCNFSM4NFU2HQQ .
-- Peter Jansen, PhD Assistant Professor, School of Information, University of Arizona web: http://cognitiveai.org
Sorry, I wish I could help, but I don't have a RTX 2080S to debug this.
No worries! I think I figured out that it might be an OOM issue. I ran it a couple of times on my 8GB GPU and saw that the training program nearly used all 8 GB.
Then after rerunning the training and changing absolutely nothing regarding the training program, It was able to run (and it has been running for at least 11 hours.)
I’m betting I just got lucky, and I’ll be searching for cloud compute resources for the future.
@SouLeo I'm working with a Titan RTX with 24gb of memory, and was getting the error even with batch sizes of 1, so I don't think it was an out-of-memory issue in my case -- in case that helps you figure out what the issue ultimately was.
Potential Fix
I was running into the same issue. Ubuntu 18.04, Cuda 10.2, Titan RTX 24GB. I followed the quick install instructions. Error happened almost immediately. Smaller batch sizes did'nt help. Running without --gpu worked.
Command:
CUDA_VISIBLE_DEVICES=1 python models/train/train_seq2seq.py --data data/json_feat_2.1.0 --model seq2seq_im_mask --dout exp/model:{model},name:pm_and_subgo als_01 --splits data/splits/oct21.json --gpu --batch 2 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 --preprocess
Output:
Namespace(action_loss_wt=1.0, actor_dropout=0.0, attn_dropout=0.0, batch=8, data='data/json_feat_2.1.0', dataset_fraction=0, dec_teacher_forcing=False, decay_epoch=10, demb=100, dframe=2500, dhid=512, dout='exp/model:seq2seq_im_mask,name:pm_and_subgoals_01', epoch=20, fast_epoch=False, gpu=True, hstate_dropout=0.3, input_dropout=0.0, lang_dropout=0.0, lr=0.0001, mask_loss_wt=1.0, model='seq2seq_im_mask', pframe=300, pm_aux_loss_wt=0.1, pp_folder='pp', preprocess=False, resume=None, save_every_epoch=False, seed=123, splits='data/splits/oct21.json', subgoal_aux_loss_wt=0.1, temp_no_history=False, vis_dropout=0.3, zero_goal=False, zero_instr=False)
{'tests_seen': 1533,
'tests_unseen': 1529,
'train': 21023,
'valid_seen': 820,
'valid_unseen': 821}
Traceback (most recent call last):
File "models/train/train_seq2seq.py", line 103, in <module>
model = model.to(torch.device('cuda'))
File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/module.py", line 386, in to
return self._apply(convert)
File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
I uninstalled the versions of torch and torchvision specified in requirements.txt
and instead installed latest versions. Everything seems to be working fine now. Is this a legitimate fix or will I run into issues using the latest pytorch with other parts of the repo?
Well... without --gpu
you are training on CPU, which would be very slow.
Sorry if I wasn't clear. I was stating that it does work while running on the cpu to point out that it is a cuda/gpu issue.
I fixed my issue by upgrading torch to the latest version instead of the version specified by requirements.txt. I want to know if there is another reason requirements.txt uses torch 1.1.0 and if anything will break if I use torch version 1.6.0.
Yeah, I figure there might be some API updates in torch 1.6.0 that might break the code. Especially with GPU training.
Getting the same error with the Docker image on RTX 2080. Could be that this card is not supported by torch==1.1.0?
@dnandha the seq2seq baselines are a bit outdated now. Checkout the SoTA models that use newer torch versions: https://github.com/askforalfred/alfred#sota-models
Hi,
I'm seeing the same error as another person posted --
(alfred_env) (base) peter@neutronium:~/github/alfred$ python models/train/train_seq2seq.py --data data/json_feat_2.1.0 --model seq2seq_im_mask --dout exp/model:{model},name:pm_and_subgoals_01 --splits data/splits/oct21.json --gpu --batch 8 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 Namespace(action_loss_wt=1.0, actor_dropout=0.0, attn_dropout=0.0, batch=8, data='data/json_feat_2.1.0', dataset_fraction=0, dec_teacher_forcing=False, decay_epoch=10, demb=100, dframe=2500, dhid=512, dout='exp/model:seq2seq_im_mask,name:pm_and_subgoals_01', epoch=20, fast_epoch=False, gpu=True, hstate_dropout=0.3, input_dropout=0.0, lang_dropout=0.0, lr=0.0001, mask_loss_wt=1.0, model='seq2seq_im_mask', pframe=300, pm_aux_loss_wt=0.1, pp_folder='pp', preprocess=False, resume=None, save_every_epoch=False, seed=123, splits='data/splits/oct21.json', subgoal_aux_loss_wt=0.1, temp_no_history=False, vis_dropout=0.3, zero_goal=False, zero_instr=False) {'tests_seen': 1533, 'tests_unseen': 1529, 'train': 21023, 'valid_seen': 820, 'valid_unseen': 821} Traceback (most recent call last): File "models/train/train_seq2seq.py", line 103, in <module> model = model.to(torch.device('cuda')) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 386, in to return self._apply(convert) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
I have verified that I've followed the installation instructions, that that the correct versions of torch (1.1.0), Torchvision (0.3.0 in requirements.txt; the prose says 1.3.0 but the latest version is 0.6.0), AI2THOR (2.1.0), and tensorboardX (1.8) have been installed.
I'm using a Titan RTX and CUDA 10.1 on KUbuntu 18.04.
Model seems to start training without the --gpu option, but it appears slow (so I didn't wait to see how long it would take).
thanks!