BLOOM Inference via DeepSpeed-Inference, Accelerate and DeepSpeed-ZeRO

update: I expanded the PR to include accelerate and deepspeed ZeRO - please see README for full details

This PR is sorting out the inference script for BLOOM via DeepSpeed-Inference https://github.com/microsoft/DeepSpeed/pull/2083

I pushed the main script into main already, so this is just the fixes of that script.

setup transformers

make sure you are on the latest transformers@main

setup DeepSpeed

Get the DS master branch

git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
pip install -e .

setup Meg-DS

git clone https://github.com/bigscience-workshop/Megatron-DeepSpeed
cd Megatron-DeepSpeed
git checkout bloom-inference

run the script:

deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --benchmark --batch_size 8

adapt to number of wanted gpus, use the larger models if needed.

p.s. also added zero3-inference script

deepspeed --num_gpus 8 scripts/inference/bloom-ds-zero-inference.py --name bigscience/bloom

but you must edit the nvme path and this one is super-slow - but it works and requires only 1x 24GB GPU! and not 8x80GB :)

On JZ you must do:

srun --pty --account=six@a100 --constraint=a100 --reservation=hug --partition=gpu_p5 --gres=gpu:8 --nodes=1 --cpus-per-task=64 --time 4:00:00 --tasks-per-node=1 bash
cd $six_ALL_CCFRWORK/code/inference/Megatron-DeepSpeed

export TRANSFORMERS_CACHE=$six_ALL_CCFRWORK/models
export HF_DATASETS_CACHE=$six_ALL_CCFRWORK/datasets
export HF_MODULES_CACHE=$six_ALL_CCFRWORK/modules
export HF_METRICS_CACHE=$six_ALL_CCFRWORK/metrics
export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1

deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom

(I already pre-cached bigscience/bloom-350m and bigscience/bloom so it should work from offline mode.

@RezaYazdaniAminabadi, @jeffra

with a single A100 it fails with:

$ deepspeed --num_gpus 1 bloom-inference.py --name bigscience/bloom-350m
[...]
DeepSpeed Transformer Inference config is  {'layer_id': 23, 'hidden_size': 1024, 'intermediate_size': 4096, 'heads': 16, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True}
[2022-07-09 22:58:55,924] [INFO] [engine.py:143:__init__] Place model to device: 0
!!!! kernel execution error. (m: 1024, n: 3, k: 4096, error: 13) 
!!!! kernel execution error. (m: 3072, n: 3, k: 1024, error: 13) 
Traceback (most recent call last):
  File "bloom-inference.py", line 154, in <module>
    min_length=50,
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 839, in forward
    self.attention(input,
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 553, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 474, in forward
    output, key_layer, value_layer, context_layer, inp_norm = selfAttention_fp()
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 438, in selfAttention_fp
    context_layer, key_layer, value_layer = compute_attention(qkv_out[0] if isinstance(qkv_out, list) else qkv_out, input_mask)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 390, in compute_attention
    context_layer, presents = backup_attention(qkv_out, layer_past, alibi, input_mask, norm_factor)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 204, in backup_attention
    value_layer) = split_tensor_along_last_dim(mixed_x_layer,
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 189, in split_tensor_along_last_dim
    return tuple(chunk.contiguous() for chunk in tensor_list)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 189, in <genexpr>
    return tuple(chunk.contiguous() for chunk in tensor_list)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[2022-07-09 22:58:59,156] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 693904

It is very weird, let me try it on my side and I will push a fix soon.

I used my own script and could run this fine, I am gonna try it with yours and see if it works.

I also produced good results with your script on my side: in=DeepSpeed is out=DeepSpeed is a function of the number of bits in the data stream, and the number of bits in the data stream is a function of the number of bits in the data stream. The number of bits in the data stream is a function of the

I found an issue with your script when running with mlti-GPU, that it results in illegal memory access. I push a fix now.

I found an issue with your script when running with mlti-GPU, that it results in illegal memory access. I push a fix now.

So, I realize I cannot push here. But the change is simple and you can do it on your side. Just change the mp_size from 1 to world_size when passing to init_inference:

model = deepspeed.init_inference(model,
                                 mp_size=world_size,
                                 dtype=torch.half,
                                 checkpoint=checkpoints_json,
                                 #injection_policy={BloomBlock: ('self_attention.dense', 'mlp.dense_4h_to_h')}
                                 replace_with_kernel_inject=True
                                 )
model = model.module

Btw, is there any chance you may have running this on 2 GPUs. Can you please retry this with using --num_nodes 1?

So, I realize I cannot push here.

I sent you an invite to this repo yesterday, please check your emails from github

The problem was that I didn't built a kernel for this card, I was able to see that through adding CUDA_LAUNCH_BLOCKING=1 and then got

RuntimeError: CUDA error: no kernel image is available for execution on the device

Fixed that now, passed that point (and yes, mp_size should be fixed as well, but was fine for 1 gpu) - thank you!

next it's now crashing here:

DeepSpeed Transformer Inference config is  {'layer_id': 23, 'hidden_size': 1024, 'intermediate_size': 4096, 'heads': 16, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True}
[2022-07-10 07:02:11,724] [INFO] [engine.py:143:__init__] Place model to device: 0
Traceback (most recent call last):
  File "bloom-inference.py", line 153, in <module>
    gen_tokens = model.generate(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 839, in forward
    self.attention(input,
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 553, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 474, in forward
    output, key_layer, value_layer, context_layer, inp_norm = selfAttention_fp()
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 438, in selfAttention_fp
    context_layer, key_layer, value_layer = compute_attention(qkv_out[0] if isinstance(qkv_out, list) else qkv_out, input_mask)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 390, in compute_attention
    context_layer, presents = backup_attention(qkv_out, layer_past, alibi, input_mask, norm_factor)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 280, in backup_attention
    context_layer = torch.bmm(attention_probs_reshaped,
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
[2022-07-10 07:02:14,292] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 729431

Hi @stas00, Cany you please see if you can run this without kernel injection? Just, remove the replace_with_kernel_inject from the init_inference and pass the injection_policy. Thanks

Reza figured it out - it was a bug in transformers's bloom model. Alibi wasn't placed on the correct device. will send a fix soon.

diff --git a/src/transformers/models/bloom/modeling_bloom.py b/src/transformers/models/bloom/modeling_bloom.py
index ba8edde14..e1723dd68 100644
--- a/src/transformers/models/bloom/modeling_bloom.py
+++ b/src/transformers/models/bloom/modeling_bloom.py
@@ -774,6 +774,7 @@ class BloomModel(BloomPreTrainedModel):
         if past_key_values[0] is not None:
             current_sequence_length += past_key_values[0][0].shape[1]
         alibi = build_alibi_tensor(current_sequence_length, self.n_head, hidden_states.dtype)
+        alibi = alibi.to(device=hidden_states.device)

         for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):

Status update - after many hard working days and nights everything works fast and great! Reza++!

Let's generate some text:

model.generate(**tokens, min_length=100, max_length=100, do_sample=False)

in=DeepSpeed is a machine learning framework

out=DeepSpeed is a machine learning framework that is designed to be used by researchers and developers who are interested in applying deep learning to their own problems. It is a Python library that provides a set of tools for training and evaluating deep neural networks. It is designed to be easy to use and to provide a flexible environment for experimentation. DeepSpeed is built on top of the Caffe deep learning framework, and it provides a set of tools for training and evaluating deep neural networks. It is designed to

OK, instrumented the script to run various benchmarks, so 8x80 a100:

deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py \
--name bigscience/bloom --benchmark

*** Performance stats:
Throughput per token: 40.73 msecs
Start to ready to generate: 673.429 secs
Tokenize and generate 100 tokens: 4.089 secs
Start to finish: 677.518 secs

While processing memory per process:

GPU: ~50GB
CPU: ~10GB

Radical!

@RezaYazdaniAminabadi run the same on 2x8x40GB A100 and it was 50msec per token. The slowness is due to internode communication. So the slowdown will depends on the internode connectivity - faster network will lead to faster throughput.

Do you plan to run the same benchmark on a100 with 40gb of global memory?

Yes, updated: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/308#issuecomment-1182707582

will publish a summary of all versions / setups soon.

I ran this distributed over 6 nodes on 14xRTXA6000 gpus total and was looking at 300ms throughput per token (batch size 1).

what does "this" refer to - there 3 different scripts/systems in this PR.

In general the throughput increases with batch size increase.

and yes, inter-node networks becomes critical when using >1 node.

Thank you for clarifying that it was ds-inference - so that I have a reference point now.

The GPUs perform at their best when they are loaded. bs=1 barely loads the gpus, so they are badly underutilized. When you increase the BS, the overall processing speed is the same, but since you can do bs=128, then you get to under 1ms / token on 8x80 single node A100.

It'll take about the same time to process bs=1 and bs=128.

So the slowdown most likely comes primarily from internode network - if you control the setup - make sure all nodes are on the same sub-net and ideally are isolated from other traffic. On GCP I saw the performance fluctuate with the traffic going up and down depending on the time of the day.

and of course A6000 is slower than A100, but shouldn't be making a huge difference to the throughput.

You'd have to ask @RezaYazdaniAminabadi if he is planning to do further speed ups, but if I got it right, this is already pretty amazing <1msec @ bs=128 on 8x80 A100.

The next speed up is likely to come from quantized to 8bit bloom.

Any specific reasons, why partitioning is not done symetrically on the 8 GPUs in bloom-accelerate-inference? 0, 51, 51, 51, 51, 51, 31 is the partition used. This looks weird to me. Is the first GPU not used at all?

Also, I have used OPT for generation too. Both BLOOM and OPT are way slower than the GPT-3 api for generation. Do we know what GPT-3 is doing to accelerate inference?

I started with balanced split and was getting OOM on low BS. gpu 0 holds all the cache and results so far so it pretty much consumes all 80GB for 100 new tokens w/ large bs. no extra space for weights.

I haven't tried other models besides BLOOM with accelerate. It's an interesting insight though. would be a good idea to benchmark

I will be giving different memory configs a shot and see if that improves inference times. For now, I am working with HF accelerate since 15 mins of loading times in DS is too high for these tests. Also, when we say throughput, throughput = time_taken / total_tokens_in_batch right?

I will be giving different memory configs a shot and see if that improves inference times. For now, I am working with HF accelerate since 15 mins of loading times in DS is too high for these tests. Also, when we say throughput, throughput = time_taken / total_tokens_in_batch right?

Not really sure why, I tried 45GB x 8 config on A100 80GB using accelerate inference but that is around 5 sec slower for 100 tokens. I think your config might be the best we have.

Also, when we say throughput, throughput = time_taken / total_tokens_in_batch

that's correct.

For now, I am working with HF accelerate since 15 mins of loading times in DS is too high for these tests.

This will change briefly. the tp pre-sharded weights and code to load those is being worked on - hopefully should be out in a few days - then it'll be on par with accelerate and may be even faster to load.

I think your config might be the best we have.

I was optimizing to allow for the highest BS, which then gives the highest throughput.

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
[2022-07-24 00:45:12,731] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 728484
[2022-07-24 00:45:12,731] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 728485
[2022-07-24 00:45:12,732] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 728486
[2022-07-24 00:45:12,732] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 728487
[2022-07-24 00:45:12,732] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 728488
[2022-07-24 00:45:12,732] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 728489
[2022-07-24 00:45:12,732] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 728490
[2022-07-24 00:45:12,732] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 728491
[2022-07-24 00:45:12,732] [ERROR] [launch.py:184:sigkill_handler] ['/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/bin/python', '-u', 'scripts/inference/bloom-ds-inference.py', '--local_rank=7', '--name', 'bigscience/bloom', '--benchmark'] exits with return code = -6

Failing in bloom-ds-inference.py (no changes made to the code)

Please always share the full traceback - and how you invoked the script and the hardware (cpu ram + gpus info)

CPU：Linux 5.4.0-110-generic #124-Ubuntu SMP Thu Apr 14 19:46:19 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux GPU：8 * nvidia a100 sxm4 80 gb I run the code (https://github.com/bigscience-workshop repo:sorteval):

from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForCausalLM
import torch
from accelerate import Accelerator
import os
from torch import nn
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
def generate_from_text(model, text, tokenizer, max_length=200, greedy=False, top_k=0):
    input_ids = tokenizer.encode(text, return_tensors='pt')
    max_length = input_ids.size(-1) + max_length

    greedy_output = model.generate(
        input_ids.to('cuda:0'),
        max_length=max_length,
        do_sample=not greedy,
        top_k=None if greedy else top_k,
    )
    return tokenizer.decode(greedy_output[0], skip_special_tokens=True)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
print("load tokenizer")
accelerator = Accelerator()
device = accelerator.device
model = AutoModelForCausalLM.from_pretrained(
    "/mnt/yrfs/aohan/checkpoints/bloom",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    offload_folder="./offload",
)
model = accelerator.prepare(model)
print("load model")

text = 'hello'
output = generate_from_text(model, text, tokenizer, max_length=500, greedy=True, top_k=1)
print(output)

and i got:

load tokenizer
Traceback (most recent call last):
  File "/mnt/yrfs/xuyifan/bloom/bigscience-sorteval/eva.py", line 30, in <module>
    model = accelerator.prepare(model)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 545, in prepare
    result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 545, in <genexpr>
    result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 441, in _prepare_one
    return self.prepare_model(obj)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 565, in prepare_model
    model = model.to(self.device)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 1.53 GiB (GPU 0; 79.17 GiB total capacity; 77.14 GiB already allocated; 1.20 GiB free; 77.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForCausalLM
import torch
from accelerate import Accelerator
import os
from torch import nn
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
def generate_from_text(model, text, tokenizer, max_length=200, greedy=False, top_k=0):
    input_ids = tokenizer.encode(text, return_tensors='pt')
    max_length = input_ids.size(-1) + max_length

    greedy_output = model.generate(
        input_ids.to('cuda:0'),
        max_length=max_length,
        do_sample=not greedy,
        top_k=None if greedy else top_k,
    )
    return tokenizer.decode(greedy_output[0], skip_special_tokens=True)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
print("load tokenizer")
accelerator = Accelerator()
device = accelerator.device
model = AutoModelForCausalLM.from_pretrained(
    "/mnt/yrfs/aohan/checkpoints/bloom",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    offload_folder="./offload",
)
model = accelerator.prepare(model)
print("load model")

text = 'hello'
output = generate_from_text(model, text, tokenizer, max_length=500, greedy=True, top_k=1)
print(output)

and i got:

load tokenizer
Traceback (most recent call last):
  File "/mnt/yrfs/xuyifan/bloom/bigscience-sorteval/eva.py", line 30, in <module>
    model = accelerator.prepare(model)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 545, in prepare
    result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 545, in <genexpr>
    result = tuple(self._prepare_one(obj, first_pass=True) for obj in args)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 441, in _prepare_one
    return self.prepare_model(obj)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/accelerate/accelerator.py", line 565, in prepare_model
    model = model.to(self.device)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/mnt/yrfs/qinkai/miniconda3/envs/bloom12/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 1.53 GiB (GPU 0; 79.17 GiB total capacity; 77.14 GiB already allocated; 1.20 GiB free; 77.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I think you are not launching via deepspeed:

deepspeed --num_gpus 8 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --benchmark

Use this

I am running on a 8x A100 80GB node with AMD EPYC 64 cores, 1.5 TB RAM

Hey, this is the full traceback:

Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed

I don't think this is a bug in the code though. The error is occuring in NCCL for some reason.

Hi @mayank31398

Do you see any kernel error before the stack trace in the log, or whether there is an illegal memory access at some place. Would you mind sharing the whole log? Thanks, Reza

@RezaYazdaniAminabadi Am I supposed to use some deepspeed branch? Full trace

[2022-07-25 08:20:58,333] [WARNING] [runner.py:159:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-07-25 08:21:01,366] [INFO] [runner.py:457:main] cmd = /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --benchmark
[2022-07-25 08:21:02,279] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2022-07-25 08:21:02,279] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=8, node_rank=0
[2022-07-25 08:21:02,279] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2022-07-25 08:21:02,279] [INFO] [launch.py:123:main] dist_world_size=8
[2022-07-25 08:21:02,279] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2022-07-25 08:21:03,698] [INFO] [comm.py:423:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom
[2022-07-25 08:21:12,069] [INFO] [utils.py:827:see_memory_usage] pre-from-pretrained
[2022-07-25 08:21:12,070] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-25 08:21:12,070] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 11.16 GB, percent = 0.9%
[2022-07-25 08:21:12,219] [INFO] [utils.py:827:see_memory_usage] post-from-pretrained
[2022-07-25 08:21:12,220] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-25 08:21:12,220] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 11.17 GB, percent = 0.9%
[2022-07-25 08:21:12,266] [INFO] [utils.py:827:see_memory_usage] post-init-ds-zero-init
[2022-07-25 08:21:12,266] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-25 08:21:12,267] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 11.17 GB, percent = 0.9%
[2022-07-25 08:21:21,406] [INFO] [utils.py:827:see_memory_usage] pre-ds-inference-init
[2022-07-25 08:21:21,407] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-25 08:21:21,407] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 32.56 GB, percent = 2.6%
[2022-07-25 08:21:21,407] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.7.0+8413b7f8, git-hash=8413b7f8, git-branch=master
[2022-07-25 08:21:21,407] [INFO] [logging.py:69:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.6 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.6 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.6 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.6 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.6 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.6 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.6 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.6 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /home/gpttest/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Using /home/gpttest/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Using /home/gpttest/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...Using /home/gpttest/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...Using /home/gpttest/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...

Using /home/gpttest/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Using /home/gpttest/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Using /home/gpttest/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/utils/cpp_extension.py:295: UserWarning:

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  warnings.warn(WRONG_COMPILER_WARNING.format(
Detected CUDA files, patching ldflags
Emitting ninja build file /home/gpttest/.cache/torch_extensions/py38_cu113/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.30841994285583496 seconds
[2022-07-25 08:21:22,096] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 14336, 'intermediate_size': 57344, 'heads': 112, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True}
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.35624146461486816 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.35947346687316895 seconds
Time to load transformer_inference op: 0.35944700241088867 seconds
Time to load transformer_inference op: 0.3590846061706543 seconds
Time to load transformer_inference op: 0.35912346839904785 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.3587977886199951 seconds
Time to load transformer_inference op: 0.35920000076293945 seconds
Loading 72 checkpoint shards:   0%|          | 0/72 [13:06<?, ?it/s]1.39s/it]
[2022-07-25 08:34:29,134] [INFO] [engine.py:145:__init__] Place model to device: 7
Loading 72 checkpoint shards:   0%|          | 0/72 [13:11<?, ?it/s]
[2022-07-25 08:34:34,207] [INFO] [engine.py:145:__init__] Place model to device: 6
Loading 72 checkpoint shards:   0%|          | 0/72 [13:13<?, ?it/s]
[2022-07-25 08:34:36,378] [INFO] [engine.py:145:__init__] Place model to device: 3
Loading 72 checkpoint shards: 100%|██████████| 72/72 [13:14<00:00, 11.04s/it]
[2022-07-25 08:34:37,256] [INFO] [engine.py:145:__init__] Place model to device: 0
[2022-07-25 08:34:37,391] [INFO] [utils.py:827:see_memory_usage] post-ds-inference-init
[2022-07-25 08:34:37,392] [INFO] [utils.py:828:see_memory_usage] MA 47.04 GB         Max_MA 47.24 GB         CA 47.04 GB         Max_CA 47 GB
[2022-07-25 08:34:37,392] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 55.12 GB, percent = 4.4%
*** Starting to generate 100 tokens with bs=1
Generate args {'max_new_tokens': 100, 'do_sample': False}
Loading 72 checkpoint shards:   0%|          | 0/72 [13:19<?, ?it/s]
[2022-07-25 08:34:42,599] [INFO] [engine.py:145:__init__] Place model to device: 1
Loading 72 checkpoint shards:   0%|          | 0/72 [13:25<?, ?it/s]
[2022-07-25 08:34:48,165] [INFO] [engine.py:145:__init__] Place model to device: 5
Loading 72 checkpoint shards:   0%|          | 0/72 [13:28<?, ?it/s]
[2022-07-25 08:34:51,199] [INFO] [engine.py:145:__init__] Place model to device: 2
Loading 72 checkpoint shards:   0%|          | 0/72 [13:34<?, ?it/s]
[2022-07-25 08:34:57,001] [INFO] [engine.py:145:__init__] Place model to device: 4
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
[2022-07-25 08:35:50,274] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 915820
[2022-07-25 08:35:50,274] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 915821
[2022-07-25 08:35:50,275] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 915822
[2022-07-25 08:35:50,275] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 915823
[2022-07-25 08:35:50,275] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 915824
[2022-07-25 08:35:50,275] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 915825
[2022-07-25 08:35:50,275] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 915826
[2022-07-25 08:35:50,275] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 915827
[2022-07-25 08:35:50,275] [ERROR] [launch.py:184:sigkill_handler] ['/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/bin/python', '-u', 'scripts/inference/bloom-ds-inference.py', '--local_rank=7', '--name', 'bigscience/bloom', '--benchmark'] exits with return code = -6

Thank you for the full traceback, @xuyifanbupt - as we have 3 different unrelated implementations in this PR could you please create a new issue for the one you reported - as it's with Accelerate and most of this thread is debugging the ds-inference one. and tag @stas00 and @sgugger on it.

Please repaste your script and the full traceback and the rest of the notes you shared. Thank you!

Hi @mayank31398

You should be using the master branch on both DeepSpeed and HuggingFace. Just note that with pip install, you may not get the latest versions. I have seen the same nccl error that I am debugging with other scenario that batch size is very large, but I am very surprised that you get it with batch 1!

Thanks, Reza

Hi @mayank31398

You should be using the master branch on both DeepSpeed and HuggingFace. Just note that with pip install, you may not get the latest versions. I have seen the same nccl error that I am debugging with other scenario that batch size is very large, but I am very surprised that you get it with batch 1!

Thanks, Reza

I am still getting this issue in when I have installed DeepSpeed and HF on master branch. I am not sure what is going wrong here

I tried with NCCL_DEBUG=INFO

llm-test-cluster-9:1281342:1283501 [1] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO group.cc:72 -> 1 [Async thread]

I see this just before the error traceback

@RezaYazdaniAminabadi yes, this is with batch size 1 Is this something to be fixed on my end? I am not sure, but this seems like a bug in deepspeed.

Also @RezaYazdaniAminabadi can you point out the branch which might contain the fix for this? Since, inferencing with HF is quite slow and DS is not working correctly. Me and my team is working on some RL based approaches using the output of LLMs, which needs inferencing to be quick. So, I want to monitor this issue closely if possible.

Can we merge this? ❤️

I don't think this is going to go here, since it has nothing to do with Meg-DS. I was just using it as a dev PR so that the Deepspeed team could push directly into it.

Once things are stable and there are still some issues to resolve on the server side, we will merge this into transformers where this really belongs.

I don't think this is going to go here, since it has nothing to do with Meg-DS. I was just using it as a dev PR so that the Deepspeed team could push directly into it.

Once things are stable and there are still some issues to resolve on the server side, we will merge this into transformers where this really belongs.

There is already an inferencing script in main branch under the same directory. Not sure if it works So, thought it would be better to have this branch merged into main branch.

@stas00 , I have currently deployed BLOOM in server mode using accelerate with batch size = 1 I see slowdown of the application over time. And GPU memory fills up. Why does your code call torch.cuda.empty_cache() during benchmarking? Is it necessary during inference?

The scripts can be found here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/325

This was when I initially deployed:

and this is after a long period of usage:

I see slowdown of the application over time. And GPU memory fills up.

Perhaps there is a memory leak somewhere? Let's ask @sgugger, as he developed accelerate

Why does your code call torch.cuda.empty_cache() during benchmarking? Is it necessary during inference?

Not at all, you don't want to do that in production most of the time, unless there is a special situation where you want to control memory freeing. I was just using it to see how much real memory was used - since pytorch tends to cache memory. so nvidia-smi doesn't tell you the real story w/o it.

There is already an inferencing script in main branch under the same directory. Not sure if it works So, thought it would be better to have this branch merged into main branch.

It was my bad, I pushed the initial version into master as I thought it was done, but then opened this PR, so that initial version is very outdated.

I would like to add generation server scripts from https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/325 to this branch

I am still seeing an illegal memory access error for batch size = 2 @RezaYazdaniAminabadi @stas00 and UnhandledCudaError: Call to CUDA function failed. with batch size=4

@jeffra This might be the error. This is not in DS-MII though. It is in the script bloom-ds-inference here. but looks similar

!!!! kernel execution error. (m: 5376, n: 2, k: 14336, error: 13)
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 521, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 829, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 461, in forward
    output, key_layer, value_layer, context_layer, inp_norm = selfAttention_fp()
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 425, in selfAttention_fp
    context_layer, key_layer, value_layer = compute_attention(qkv_out[0] if isinstance(qkv_out, list) else qkv_out, input_mask)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 373, in compute_attention
    context_layer, presents = backup_attention(qkv_out, layer_past, alibi, input_mask, norm_factor)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 194, in backup_attention
    alibi = alibi.to(torch.cuda.current_device())
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@stas00 , I see the memory configuration is [0, 51, 51, 51, 51, 51, 51, 51]. I tried using [45, 45, 45, 45, 45, 45, 45, 45] (symmetricall config) and I didn't see any change in the generation throughput. Can you confirm?

Also, @stas00 https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/325 use a lot of code that is written in your scripts. If https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/325 can be merged into this branch, that would be helpful. Let me know if you feel that is right or not.

I have also done a bit of code refactoring so these duplicate methods can be reused across scripts.

@stas00 , I see the memory configuration is [0, 51, 51, 51, 51, 51, 51, 51]. I tried using [45, 45, 45, 45, 45, 45, 45, 45] (symmetricall config) and I didn't see any change in the generation throughput. Can you confirm?

The only reason to keep the first gpu unallocated with model weights is to allow for a much higher BS.

If you don't need the higher BS, you don't need to do that.

wrt merging - let me merge this PR for now.

then rebase your PR and refactor, then we will see how to proceed.

then later we will move the whole thing into transformers.

I gave you write access to the repo, so it'll be easier for you to contribute. Just please don't push directly w/o a PR and wait for at least one approval from another members before merging.

bigscience-workshop / Megatron-DeepSpeed