Errors in generation (Bloom) when changing options sampling/use_cache

thies1006 commented 2 years ago

I'm running the inference script bloom-ds-inference.py by invoking: deepspeed --num_gpus 1 ~/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom-1b3 --benchmark, but I change the generation arguments to generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=False, use_cache=False) (adding use_cache option).

Error:

*** Starting to generate 100 tokens with bs=1
Generate args {'max_new_tokens': 100, 'do_sample': False, 'use_cache': False}
!!!! kernel execution error. (m: 8192, n: 74, k: 2048, error: 13) 
!!!! kernel execution error. (m: 2048, n: 74, k: 8192, error: 13) 
!!!! kernel execution error. (m: 6144, n: 74, k: 2048, error: 13) 
Traceback (most recent call last):
  File "/secondary/thies/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "/secondary/thies/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1294, in generate
    return self.greedy_search(
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1689, in greedy_search
    outputs = self(
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 821, in forward
    transformer_outputs = self.transformer(
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 709, in forward
    outputs = block(
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 829, in forward
    self.attention(input,
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 461, in forward
    output, key_layer, value_layer, context_layer, inp_norm = selfAttention_fp()
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 425, in selfAttention_fp
    context_layer, key_layer, value_layer = compute_attention(qkv_out[0] if isinstance(qkv_out, list) else qkv_out, input_mask)
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 373, in compute_attention
    context_layer, presents = backup_attention(qkv_out, layer_past, alibi, input_mask, norm_factor)
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 203, in backup_attention
    value_layer) = split_tensor_along_last_dim(mixed_x_layer,
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 189, in split_tensor_along_last_dim
    return tuple(chunk.contiguous() for chunk in tensor_list)
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 189, in <genexpr>
    return tuple(chunk.contiguous() for chunk in tensor_list)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
  what():  NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:172, unhandled cuda error, NCCL version 21.0.3
Process Group destroyed on rank 0
Exception raised from ncclCommAbort at ../torch/csrc/distributed/c10d/NCCLUtils.hpp:172 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fe702e251dc in /secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7fe702e02c96 in /secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x2d00603 (0x7fe627b47603 in /secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x1d1 (0x7fe627b29a01 in /secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0xd (0x7fe627b29ebd in /secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x115a211 (0x7fe63e830211 in /secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x11362eb (0x7fe63e80c2eb in /secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0xa030e2 (0x7fe63e0d90e2 in /secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xa040a3 (0x7fe63e0da0a3 in /secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: /secondary/thies/.virtualenvs/bloom/bin/python() [0x5cedf8]
frame #10: /secondary/thies/.virtualenvs/bloom/bin/python() [0x5d1cdc]
frame #11: PyDict_Clear + 0xeb (0x5cef3b in /secondary/thies/.virtualenvs/bloom/bin/python)
frame #12: /secondary/thies/.virtualenvs/bloom/bin/python() [0x6aa1ba]
frame #13: /secondary/thies/.virtualenvs/bloom/bin/python() [0x4ef8d8]
frame #14: _PyGC_CollectNoFail + 0x2f (0x672bcf in /secondary/thies/.virtualenvs/bloom/bin/python)
frame #15: PyImport_Cleanup + 0x314 (0x685414 in /secondary/thies/.virtualenvs/bloom/bin/python)
frame #16: Py_FinalizeEx + 0x7f (0x68040f in /secondary/thies/.virtualenvs/bloom/bin/python)
frame #17: Py_RunMain + 0x32d (0x6b7a1d in /secondary/thies/.virtualenvs/bloom/bin/python)
frame #18: Py_BytesMain + 0x2d (0x6b7c8d in /secondary/thies/.virtualenvs/bloom/bin/python)
frame #19: __libc_start_main + 0xf3 (0x7fe716fff0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: _start + 0x2e (0x5fb12e in /secondary/thies/.virtualenvs/bloom/bin/python)

[2022-08-03 15:03:43,770] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 140325
[2022-08-03 15:03:43,770] [ERROR] [launch.py:292:sigkill_handler] ['/secondary/thies/.virtualenvs/bloom/bin/python', '-u', '/secondary/thies/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py', '--local_rank=0', '--name', 'bigscience/bloom-1b3', '--benchmark'] exits with return code = -6

When using generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=True, use_cache=False) I get a different error:

*** Starting to generate 100 tokens with bs=1
Generate args {'max_new_tokens': 100, 'do_sample': True, 'use_cache': False}
Traceback (most recent call last):
  File "/secondary/thies/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "/secondary/thies/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1326, in generate
    return self.sample(
  File "/secondary/thies/.virtualenvs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1981, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
[2022-08-03 15:06:16,298] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 140658
[2022-08-03 15:06:16,298] [ERROR] [launch.py:292:sigkill_handler] ['/secondary/thies/.virtualenvs/bloom/bin/python', '-u', '/secondary/thies/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py', '--local_rank=0', '--name', 'bigscience/bloom-1b3', '--benchmark'] exits with return code = 1

thies1006 commented 2 years ago

The second error appears very sporadically (after thousands of cycles) as well when only using generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=True), but only (I think) when changing the input between each cycle, e.g. using:

def generate():
    """ returns a list of zipped inputs, outputs and number of new tokens """
    random.shuffle(input_sentences)
    inputs = input_sentences[:args.batch_size]

pai4451 commented 2 years ago

I had the same issue. On my side this illegal memory access error only happens for batch size 2 and 4, but with batch size 8 to 32 I can run the inference script. For batch size 1, I observed that this error is intermitted, it only happens when I have a longer input.

thies1006 commented 2 years ago

Thanks @pai4451. To follow up, my impression is that the script works fine for short texts, but when it comes to longer ones (>200 tokens) it more likely to crash. For me it crashes for all batch sizes I tried (1, 2, 4, 8), the error message is not always the same however. The problem with the sampling was on my side. I guess I ended up with the same problem as in #318.

pohunghuang-nctu commented 2 years ago

@RezaYazdaniAminabadi Hi, just for your inference. I have tested https://github.com/microsoft/DeepSpeed/pull/2196 but it seems not resolving the issue of "illegal memory access" at our side.

mayank31398 commented 2 years ago

@pohunghuang-nctu can you confirm your cuda version? I was using 11.6 and getting the same issue. Using 11.3 resolved it for me. Please give it a try. Thanks

pohunghuang-nctu commented 2 years ago

@pohunghuang-nctu can you confirm your cuda version? I was using 11.6 and getting the same issue. Using 11.3 resolved it for me. Please give it a try. Thanks

@mayank31398 Thanks for suggestion. We're in pytorch 1.11 + cuda 11.5, and how about your pytorch version? By the way, now you're running bloom on single node (A100 * 8) or multiple node?

mayank31398 commented 2 years ago

@pohunghuang-nctu I have PyTorch installed using conda (with CUDA 11.3) and DeepSpeed and apex have been installed from master branch using CUDA 11.3

pai4451 commented 2 years ago

@pohunghuang-nctu can you confirm your cuda version? I was using 11.6 and getting the same issue. Using 11.3 resolved it for me. Please give it a try. Thanks

@mayank31398 Thanks for the information. Can I confirm that you no longer encounter the illegal memory access error with CUDA 11.3 and the latest build from master branch DeepSpeed for the following two cases?

long input tokens with batch size 1
batch size > 1

after installing DeepSpeed microsoft/DeepSpeed#2196 and CUDA 11.3?

I also tried the same (CUDA 11.3 and the latest DeepSpeed from the master branch) but still facing illegal memory access in both cases. Also, I ran the code with two nodes, each consisting of 8x A6000 GPUs, maybe this is the difference?

mayank31398 commented 2 years ago

I haven't played around that much with it. But batch size >1 is working for me.

mayank31398 commented 2 years ago

I only have a single node with 8 GPUS 80GB each. Are you using pipeline parallel across nodes? Does DS-inference support that?

RezaYazdaniAminabadi commented 2 years ago

@pai4451 Currently, I limit the token-length for each query to 128. I am gonna increase this soon. But, can you try with smaller length and see if the issue is resolved? Thanks

RezaYazdaniAminabadi commented 2 years ago

Regarding the batch size, I have tried with up to 128 batch and it was working fine on my side.

pai4451 commented 2 years ago

I only have a single node with 8 GPUS 80GB each. Are you using pipeline parallel across nodes? Does DS-inference support that?

@mayank31398 Thanks. I just launched DeepSpeed with an additional hostfile argument to run on multiple nodes.

@RezaYazdaniAminabadi Yeah, no error will happen when I use input tokens 128 and batch size one. But I found that input tokens longer than 600 will lead to illegal memory access on my two nodes server, with each node having 8x A6000. About different batch sizes, I found that I can run the inference code with batch sizes 1, 8, 16, and 32 but will raise an illegal memory access error with batch sizes 2 and 4.

My environment with Python 3.8 and CUDA 11.3:

torch==1.12.1+cu113
deepspeed==0.7.1+28dfca8a

pohunghuang-nctu commented 2 years ago

I only have a single node with 8 GPUS 80GB each. Are you using pipeline parallel across nodes? Does DS-inference support that?

DS-inference supports multi-node with no doubt.

Deepspeed MII do NOT support multi-node, because the way calling deepspeed was hardcoded as using "--num_gpus". We have did some customization to allow it accept hostfile to spawn processes into multi-nodes.

mayank31398 commented 2 years ago

@pohunghuang-nctu @pai4451 thanks for letting me know about the multi-node deployment. I am guessing this would be using pipeline parallelism? However, what are the advantages of using multi-node during inference? I am guessing this would be slower than a single node 8x A100 80 GB GPUs right?

pai4451 commented 2 years ago

@mayank31398 I don’t think there is much advantage on using multi-node for inference. We need multi-node for inference just because we only have several 8x A6000 48GB servers.

pai4451 commented 2 years ago

@pohunghuang-nctu can you confirm your cuda version? I was using 11.6 and getting the same issue. Using 11.3 resolved it for me. Please give it a try. Thanks

@mayank31398 May I ask about your transformers and deepspeed versions? I just found that using the lastest master branch of transformers makes illegal memory access even worse.

mayank31398 commented 2 years ago

I built deepspeed from source (master branch). Also, transformers is 4.20 transformers (4.21.1) installed using pip

thies1006 commented 2 years ago

On my side I still get the error RuntimeError: CUDA error: an illegal memory access was encountered (with 128 input tokens, Cuda 11.3 and batch size 1). However, my impression is that it gets more rare with lower number of input tokens.

deepspeed info ................... 0.7.1+9b418c1e, 9b418c1e deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3

mayank31398 commented 2 years ago

I haven't tried adjusting input tokens @thies1006 But I can confirm, I ran with input text = "Hello" and generated tokens from 10, 50, 100, 300, 500, 1000, 2000, 5000. And it didn't crash for me in any scenario.

pai4451 commented 2 years ago

@mayank31398 From my impression, it is the number of input tokens that matters the illegal memory access error instead of the number of generated tokens. I can also generate two to three thousand tokens without issue when input is short. But when I increase the input sequence to a certain amount (in my case, above 600 tokens), these errors happen all time.

Maybe if you have some time, could you try to increase the number of input tokens and check whether the issue really doesn't bother on your side?

mayank31398 commented 2 years ago

I see @pai4451. Ill give it a shot.

thies1006 commented 2 years ago

felix-schneider commented 2 years ago

I am having a similar error running the model on 4 4xA100 40GB cards with batch size 1. After about 18 examples, I get the Cuda illegal access error. This does not seem to be closely related to the sequence length though, because it happens at about the same time whether use priming examples or not (which increases the prompt length significantly).

I am on CUDA 11.3, pytorch 1.10 and deepspeed 0.7.0 Edit: I can confirm this is still happening with deepspeed 0.7.1.

mayank31398 commented 2 years ago

@RezaYazdaniAminabadi any followup on this? I am facing similar CUDA issues with longer input sequence lengths.

RezaYazdaniAminabadi commented 2 years ago

Hi @mayank31398,

I am still working on this. Can I ask what an average maximum number of tokens for an input would be? Potentially, this can go to as many tokens as user requests, but unfortunately there is a limit on what can be cached. Thanks, Reza

mayank31398 commented 2 years ago

@RezaYazdaniAminabadi I am also not sure but BLOOM is trained using ALiBi, ideally there should be no limit. I understand that this might not be possible. But GPT-3 allowed input + generated tokens = 4000 tokens. Is that a target you think might be possible?

pai4451 commented 2 years ago

@RezaYazdaniAminabadi I can share my findings. I use two 8x A6000 (48G) nodes for inference, and when the input tokens more than 600 it will always lead to the CUDA illegal memory access error (no matter what value I set for the number of output tokens).

trianxy commented 2 years ago

Hi @mayank31398,

I am still working on this. Can I ask what an average maximum number of tokens for an input would be? Potentially, this can go to as many tokens as user requests, but unfortunately there is a limit on what can be cached. Thanks, Reza

@RezaYazdaniAminabadi I often run such models with ~1000 input tokens and generate ~500 tokens.

bigscience-workshop / Megatron-DeepSpeed

Errors in generation (Bloom) when changing options sampling/use_cache #324