allenai / PRIMER

The official code for PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
Apache License 2.0
153 stars 32 forks source link

RuntimeError: CUDA error: device-side assert triggered #27

Closed FightingEveryDay0 closed 1 year ago

FightingEveryDay0 commented 1 year ago

When running the code on multi_news dataset, it raises error:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [217,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [217,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [217,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [217,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [217,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "primer_main.py", line 829, in <module>
    test(args)
  File "primer_main.py", line 621, in test
    trainer.test(model, test_dataloader)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 922, in test
    results = self.__test_given_model(model, test_dataloaders)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in __test_given_model
    results = self.fit(model)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 513, in fit
    self.dispatch()
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 547, in dispatch
    self.accelerator.start_testing(self)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 77, in start_testing
    self.training_type_plugin.start_testing(trainer)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 115, in start_testing
    self._results = trainer.run_test()
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 793, in run_test
    eval_loop_results, _ = self.run_evaluation()
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 732, in run_evaluation
    output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 160, in evaluation_step
    output = self.trainer.accelerator.test_step(args)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 196, in test_step
    return self.training_type_plugin.test_step(*args)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 293, in test_step
    return self.model(*args, **kwargs)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 59, in forward
    output = self.module.test_step(*inputs, **kwargs)
  File "primer_main.py", line 353, in test_step
    return self.validation_step(batch, batch_idx)
  File "primer_main.py", line 274, in validation_step
    loss = self.shared_step(input_ids, output_ids)
  File "primer_main.py", line 150, in shared_step
    lm_logits = self.forward(input_ids, output_ids)
  File "primer_main.py", line 119, in forward
    use_cache=False,
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 1113, in forward
    return_dict=return_dict,
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 956, in forward
    return_dict=return_dict,
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 335, in forward
    embed_pos = self.embed_positions(input_ids)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 859, in forward
    return super().forward(positions + self.offset)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/wangyiting/anaconda3/envs/primer/lib/python3.7/site-packages/torch/nn/functional.py", line 1916, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

I found that there is only one dataset named "multi_news" and did anyone run the code primer_main.py on multi_news without the above problems? It's quite strange that the error only occur on the multi_news dataset in my experiment /(ㄒoㄒ)/~~ Thank you very much!

raymondsim commented 1 year ago

Hi I wonder how did you manage to fix this problem? Because I'm facing the same issue too. Changing max_position_embedding in config does not help.

Update after few minutes: I fixed this by changing the max_length_input in primer_main.py. Apparently dataloader does process and output 4097 tokens when the max is set to 4096. When I set the max_length_input to 2048, the max I ended up getting is 2049.