facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.48k stars 2.1k forks source link

CUDA error: device-side assert triggered #3666

Closed SYSU-lulc closed 3 years ago

SYSU-lulc commented 3 years ago

When training the Bart model with the cornell_movie task, CUDA errors occurred during the first validation.

When i use the command: parlai train_model -m bart --fp16 true -eps 101 --optimizer adam -sval True -veps 1 -mf ./data/models/bart/model_cornell_movie_lr1e-5 -t cornell_movie -bs 4 -lr 1e-5 -gpu 1 --truncate 1024

Errors occurred: 15:09:05 | creating task(s): cornell_movie 15:09:06 | running eval: valid Traceback (most recent call last): File "/mnt/sda/anaconda3/envs/BEN/bin/parlai", line 33, in <module> sys.exit(load_entry_point('parlai', 'console_scripts', 'parlai')()) File "/mnt/sda/ben/util/ParlAI/parlai/__main__.py", line 14, in main superscript_main() File "/mnt/sda/ben/util/ParlAI/parlai/core/script.py", line 325, in superscript_main return SCRIPT_REGISTRY[cmd].klass._run_from_parser_and_opt(opt, parser) File "/mnt/sda/ben/util/ParlAI/parlai/core/script.py",line 108, in _run_from_parser_and_opt return script.run() File "/mnt/sda/ben/util/ParlAI/parlai/scripts/train_model.py", line 935, in run return self.train_loop.train() File "/mnt/sda/ben/util/ParlAI/parlai/scripts/train_model.py", line 899, in train for _train_log in self.train_steps(): File "/mnt/sda/ben/util/ParlAI/parlai/scripts/train_model.py", line 850, in train_steps stop_training = self.validate() File "/mnt/sda/ben/util/ParlAI/parlai/scripts/train_model.py", line 500, in validate self.valid_worlds, opt, 'valid', opt['validation_max_exs'] File "/mnt/sda/ben/util/ParlAI/parlai/scripts/train_model.py", line 627, in _run_eval task_report = self._run_single_eval(opt, v_world, max_exs_per_worker) File "/mnt/sda/ben/util/ParlAI/parlai/scripts/train_model.py", line 593, in _run_single_eval valid_world.parley() File "/mnt/sda/ben/util/ParlAI/parlai/core/worlds.py", line 865, in parley batch_act = self.batch_act(agent_idx, batch_observations[agent_idx]) File "/mnt/sda/ben/util/ParlAI/parlai/core/worlds.py", line 833, in batch_act batch_actions = a.batch_act(batch_observation) File "/mnt/sda/ben/util/ParlAI/parlai/core/torch_agent.py", line 2207, in batch_act output = self.eval_step(batch) File "/mnt/sda/ben/util/ParlAI/parlai/core/torch_generator_agent.py", line 891, in eval_step beam_preds_scores, beams = self._generate(batch, self.beam_size, maxlen) File "/mnt/sda/ben/util/ParlAI/parlai/core/torch_generator_agent.py", line 1111, in _generate score, incr_state = model.decoder(decoder_input, encoder_states, incr_state) File "/mnt/sda/anaconda3/envs/BEN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/mnt/sda/ben/util/ParlAI/parlai/agents/transformer/modules/decoder.py", line 385, in forward tensor, encoder_output, encoder_mask, incr_state File "/mnt/sda/ben/util/ParlAI/parlai/agents/transformer/modules/decoder.py", line 342, in forward_layers incr_state=incr_state.get(idx), File "/mnt/sda/anaconda3/envs/BEN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/mnt/sda/ben/util/ParlAI/parlai/agents/transformer/modules/decoder.py", line 136, in forward x = self.ffn(x) File "/mnt/sda/anaconda3/envs/BEN/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/mnt/sda/ben/util/ParlAI/parlai/agents/transformer/modules/ffn.py", line 47, in forward x = self.nonlinear(self.lin1(x)) File "/mnt/sda/anaconda3/envs/BEN/lib/python3.7/site-packages/torch/nn/functional.py", line 1459, in gelu return torch._C._nn.gelu(input) RuntimeError: CUDA error: device-side assert triggered

But training the Bart model with other tasks (e.g., convai2 and dailydialog) in a similar way will not happen like this. Training other models (e.g., dialoGPT and Bert) with cornell_movie task will not happen like this either.

Thanks for your attention.

SYSU-lulc commented 3 years ago

Oops... Sorry for the wrong typesetting.

stephenroller commented 3 years ago

Bart has a maximum truncation of 512.

SYSU-lulc commented 3 years ago

Thanks a lot!!! I'll try it.