facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.17k stars 6.37k forks source link

python generate.py encountered the error "RuntimeError: CUDA error: device-side assert triggered" #2286

Open deepTransformer opened 4 years ago

deepTransformer commented 4 years ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

  1. Run cmd '....'
    python -u .//generate.py processed_convai2_none/bin --path ./models_convai2/model_bert_encoder_embedding/checkpoint9.pt --beam 6 --nbest 6 --gen-subset valid --max-sentences 196 --max-tokens 5000 --max-len-b 25 --remove-bpe ' ##' --sampling --sampling-topp 0.9 --no-repeat-ngram-size 2 --temperature 0.7

    processed_convai2_none/bin is the data path, checkpoint9.pt is the model I have trained

  2. See error
    /opt/conda/conda-bld/pytorch_1579027003190/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:256: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [2,0,0], thread: [0,0,0] Assertion `sum > accZero` failed.
    Traceback (most recent call last):
    File ".//generate.py", line 11, in <module>
    cli_main()
    File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq_cli/generate.py", line 269, in cli_main
    main(args)
    File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq_cli/generate.py", line 36, in main
    return _main(args, sys.stdout)
    File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq_cli/generate.py", line 145, in _main
    hypos = task.inference_step(generator, models, sample, prefix_tokens)
    File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq/tasks/fairseq_task.py", line 356, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
    File "/home/research/miniconda3/envs/torch1.4/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
    File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq/sequence_generator.py", line 161, in generate
    return self._generate(sample, **kwargs)
    File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq/sequence_generator.py", line 310, in _generate
    scores.view(bsz, beam_size, -1)[:, :, :step],
    File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq/search.py", line 272, in step
    beams_buf = torch.arange(0, beam_size).to(indices_buf).repeat(bsz, 1)
    RuntimeError: CUDA error: device-side assert triggered

Code sample

Expected behavior

In the sequence_generator.py, when the step == max_len, the lprobs will be assigned -math.inf expect eos, we wish the generator generate the eos token.

if step >= max_len:
    lprobs[:, : self.eos] = -math.inf
    lprobs[:, self.eos + 1 :] = -math.inf

the following code

if self.no_repeat_ngram_size > 0:
    lprobs = self._no_repeat_ngram(tokens, lprobs, bsz, beam_size, step)

However,sometimes all elements of lprobs will be assigned -math.inf include eos when running the code in search.py, I encountered the error, because all elements of lprobs is 0 in the following code, we can't use torch.multinomial to sample

indices_buf = torch.multinomial(
    probs.view(bsz * beam_size, -1),
    1,
    replacement=True,
).view(bsz, beam_size)

Environment

Additional context

lematt1991 commented 4 years ago

You'll want to re-run with the following environment variable: CUDA_LAUNCH_BLOCKING=1. This should give you a more informative error message.

deepTransformer commented 4 years ago

You'll want to re-run with the following environment variable: CUDA_LAUNCH_BLOCKING=1. This should give you a more informative error message.

/opt/conda/conda-bld/pytorch_1579027003190/work/aten/src/ATen/native/cuda/MultinomialKernel.cu:256: void at::native::<unnamed>::sampleMultinomialOnce(long *, long, int, scalar_t *, scalar_t *, int, int) [with scalar_t = float, accscalar_t = float]: block: [4,0,0], thread: [0,0,0] Assertion `sum > accZero` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1579027003190/work/aten/src/THC/generic/THCTensorScatterGather.cu line=71 error=59 : device-side assert triggered
Traceback (most recent call last):
  File ".//generate.py", line 11, in <module>
    cli_main()
  File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq_cli/generate.py", line 269, in cli_main
    main(args)
  File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq_cli/generate.py", line 36, in main
    return _main(args, sys.stdout)
  File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq_cli/generate.py", line 145, in _main
    hypos = task.inference_step(generator, models, sample, prefix_tokens)
  File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq/tasks/fairseq_task.py", line 356, in inference_step
    return generator.generate(models, sample, prefix_tokens=prefix_tokens)
  File "/home/research/miniconda3/envs/torch1.4/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq/sequence_generator.py", line 161, in generate
    return self._generate(sample, **kwargs)
  File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq/sequence_generator.py", line 310, in _generate
    scores.view(bsz, beam_size, -1)[:, :, :step],
  File "/home/research/haha/research-transfer-dialouge/fairseq/fairseq/search.py", line 257, in step
    probs, dim=2, index=indices_buf.unsqueeze(-1)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1579027003190/work/aten/src/THC/generic/THCTensorScatterGather.cu:71
hyunwoongko commented 3 years ago

I have same error.

Traceback (most recent call last):
  File "reddit_lm.py", line 94, in <module>
    output = reddit.predict(0, input(">>> : "))
  File "reddit_lm.py", line 77, in predict
    no_repeat_ngram_size=4,
  File "/opt/conda/lib/python3.7/site-packages/fairseq/hub_utils.py", line 127, in sample
    return self.sample([sentences], beam=beam, verbose=verbose, **kwargs)[0]
  File "/opt/conda/lib/python3.7/site-packages/fairseq/hub_utils.py", line 129, in sample
    batched_hypos = self.generate(tokenized_sentences, beam, verbose, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/fairseq/hub_utils.py", line 170, in generate
    generator, self.models, batch, **inference_step_args
  File "/opt/conda/lib/python3.7/site-packages/fairseq/tasks/language_modeling.py", line 314, in inference_step
    models, sample, prefix_tokens=prefix_tokens, bos_token=bos_token
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/fairseq/sequence_generator.py", line 177, in generate
    return self._generate(sample, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/fairseq/sequence_generator.py", line 378, in _generate
    original_batch_idxs,
  File "/opt/conda/lib/python3.7/site-packages/fairseq/search.py", line 714, in step
    replacement=True,
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
thinkwee commented 3 years ago

have the same problem

ryonakamura commented 3 years ago

Is there any way to avoid this? For example, by adding some kind of modification to the logits before softmax.

ryonakamura commented 3 years ago

This problem can be solved by assigning 1 to the eos element in fairseq/sequence_generator.py.

Before:

# handle max length constraint
if step >= max_len:
    lprobs[:, : self.eos] = -math.inf
    lprobs[:, self.eos + 1 :] = -math.inf

After:

# handle max length constraint
if step >= max_len:
    lprobs[:, : self.eos] = -math.inf
    lprobs[:, self.eos + 1 :] = -math.inf
    lprobs[:, self.eos] = 1
Niuyuhang03 commented 2 years ago

Have the same problem. Is there any solution that don't have to modify the source code of fairseq/sequence_generator.py?