Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

asaparov commented 2 years ago

I am trying to get multi-node inference working with 4 nodes, each with 4xRTX8000 GPUs (48GB per GPU). deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom.

The script finishes loading all the checkpoints and begins inference but then quickly runs into the following error:

...
gr061: loading checkpoint (68)
gr061: loading checkpoint (69)
gr061: loading checkpoint (70)
gr063: [2022-07-20 19:03:10,723] [INFO] [engine.py:144:__init__] Place model to device: 0
gr061: loading checkpoint (71)
gr061: [2022-07-20 19:03:11,443] [INFO] [engine.py:144:__init__] Place model to device: 0
gr061: *** Starting to generate 100 tokens with bs=1
gr061: Generate args {'max_new_tokens': 100, 'do_sample': False}
gr064: [2022-07-20 19:03:12,551] [INFO] [engine.py:144:__init__] Place model to device: 3
gr061: [2022-07-20 19:03:13,294] [INFO] [engine.py:144:__init__] Place model to device: 3
gr062: [2022-07-20 19:03:14,244] [INFO] [engine.py:144:__init__] Place model to device: 2
gr062: [2022-07-20 19:03:14,406] [INFO] [engine.py:144:__init__] Place model to device: 0
gr063: [2022-07-20 19:03:14,791] [INFO] [engine.py:144:__init__] Place model to device: 2
gr064: [2022-07-20 19:03:15,444] [INFO] [engine.py:144:__init__] Place model to device: 2
gr061: [2022-07-20 19:03:15,542] [INFO] [engine.py:144:__init__] Place model to device: 2
gr061: [2022-07-20 19:03:15,618] [INFO] [engine.py:144:__init__] Place model to device: 1
gr062: [2022-07-20 19:03:16,179] [INFO] [engine.py:144:__init__] Place model to device: 3
gr062: [2022-07-20 19:03:16,513] [INFO] [engine.py:144:__init__] Place model to device: 1
gr064: [2022-07-20 19:03:16,777] [INFO] [engine.py:144:__init__] Place model to device: 0
gr064: [2022-07-20 19:03:17,541] [INFO] [engine.py:144:__init__] Place model to device: 1
gr063: [2022-07-20 19:03:18,336] [INFO] [engine.py:144:__init__] Place model to device: 3
gr063: [2022-07-20 19:03:18,547] [INFO] [engine.py:144:__init__] Place model to device: 1
gr064: Traceback (most recent call last):
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064:     _ = generate()
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064:     outputs = model.generate(**input_tokens, **generate_kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: Traceback (most recent call last):
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064:     _ = generate()
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064:     outputs = model.generate(**input_tokens, **generate_kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064: Traceback (most recent call last):
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 257, in <module>
gr064:     _ = generate()
gr064:   File "/scratch/as17582/Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py", line 244, in generate
gr064:     outputs = model.generate(**input_tokens, **generate_kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
gr064:     return func(*args, **kwargs)
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064: return func(*args, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064:     return func(*args, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064:     return func(*args, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1288, in generate
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
gr064:     outputs = self(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     outputs = self(
gr064:     outputs = self(
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: outputs = self(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:       File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:     return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
gr064:
gr064:   File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:   File "/scratch/as17582/deepspeed/deepspeed/inference/engine.py", line 505, in forward
gr064:         outputs = self.model_orig_fwd(*inputs, **kwargs)outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064:
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:     outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:     outputs = self.model_orig_fwd(*inputs, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
gr064:                 transformer_outputs = self.transformer(transformer_outputs = self.transformer(transformer_outputs = self.transformer(transformer_outputs = self.transformer(
gr064:
gr064:
gr064:
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064:         return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
gr064:
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064: return forward_call(*input, **kwargs)
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
gr064:     outputs = block(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     outputs = block(
gr064: outputs = block(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     outputs = block(
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 828, in forward
gr064:     self.attention(input,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     self.attention(input,
gr064:       File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064: self.attention(input,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     self.attention(input,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     return forward_call(*input, **kwargs)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 541, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     output = DeepSpeedSelfAttentionFunction.apply(
gr064:   File "/scratch/as17582/deepspeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 464, in forward
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     dist.all_reduce(output, group=mp_group)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/comm.py", line 312, in all_reduce
gr064:     return cdb.all_reduce(tensor, op, group, async_op)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return cdb.all_reduce(tensor, op, group, async_op)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:     return cdb.all_reduce(tensor, op, group, async_op)
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:     return cdb.all_reduce(tensor, op, group, async_op)  File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:
gr064:   File "/scratch/as17582/deepspeed/deepspeed/comm/torch.py", line 48, in all_reduce
gr064:     return torch.distributed.all_reduce(tensor=tensor,
gr064:   File "/ext3/miniconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1322, in all_reduce
gr064:     work = group.allreduce([tensor], opts)
gr064: work = group.allreduce([tensor], opts)
gr064:     work = group.allreduce([tensor], opts)
gr064: RuntimeErrorRuntimeError: :     NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError
gr064: work = group.allreduce([tensor], opts)
gr064: :
gr064: NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.
gr064: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352657443/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
gr064: ncclUnhandledCudaError: Call to CUDA function failed.
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fae5f70b477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7fae8ccfc4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7fae8cd02417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7fae9f4f0c68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fae5f6eed95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7fae9f3e5b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7fae9f719fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7fae9f71a2c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x55ccd72e1e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x55ccd72eead8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x55ccd73027ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x55ccd73027bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x55ccd72d6661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x55ccd72dc81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x55ccd73ceaec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x55ccd73cdf56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x55ccd73c12b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x55ccd7393b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7faee4a9a0b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x55ccd7393a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f183ee2a477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7f186c41b4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f186c421417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7f187ec0fc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f183ee0dd95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7f187eb04b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7f187ee38fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f187ee392c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5616533d4e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5616533e1ad8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5616533f57ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5616533f57bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5616533c9661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5616533cf81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x5616534c1aec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x5616534c0f56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x5616534b42b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x561653486b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7f18c41b90b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x561653486a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb213ab8477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7fb2410a94a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7fb2410af417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7fb25389dc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fb213a9bd95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7fb253792b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7fb253ac6fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7fb253ac72c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5616125aee28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5616125bbad8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5616125cf7ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5616125cf7bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5616125a3661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5616125a981a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x56161269baec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x56161269af56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x56161268e2b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x561612660b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7fb298e470b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x561612660a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: terminate called after throwing an instance of 'c10::CUDAError'
gr064:   what():  CUDA error: an illegal memory access was encountered
gr064: CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
gr064: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
gr064: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr064: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8724e9e477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #1: <unknown function> + 0x1d4a3 (0x7f875248f4a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f8752495417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr064: frame #3: <unknown function> + 0x458c68 (0x7f8764c83c68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f8724e81d95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr064: frame #5: <unknown function> + 0x34db35 (0x7f8764b78b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #6: <unknown function> + 0x681fc8 (0x7f8764eacfc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f8764ead2c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr064: frame #8: <unknown function> + 0x127e28 (0x5640a0321e28 in /ext3/miniconda3/bin/python3.9)
gr064: frame #9: <unknown function> + 0x134ad8 (0x5640a032ead8 in /ext3/miniconda3/bin/python3.9)
gr064: frame #10: <unknown function> + 0x1487ce (0x5640a03427ce in /ext3/miniconda3/bin/python3.9)
gr064: frame #11: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #12: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #13: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #14: <unknown function> + 0x1487bb (0x5640a03427bb in /ext3/miniconda3/bin/python3.9)
gr064: frame #15: <unknown function> + 0x11c661 (0x5640a0316661 in /ext3/miniconda3/bin/python3.9)
gr064: frame #16: PyDict_SetItemString + 0x4a (0x5640a031c81a in /ext3/miniconda3/bin/python3.9)
gr064: frame #17: <unknown function> + 0x214aec (0x5640a040eaec in /ext3/miniconda3/bin/python3.9)
gr064: frame #18: Py_FinalizeEx + 0x186 (0x5640a040df56 in /ext3/miniconda3/bin/python3.9)
gr064: frame #19: Py_RunMain + 0x112 (0x5640a04012b2 in /ext3/miniconda3/bin/python3.9)
gr064: frame #20: Py_BytesMain + 0x39 (0x5640a03d3b79 in /ext3/miniconda3/bin/python3.9)
gr064: frame #21: __libc_start_main + 0xf3 (0x7f87aa22d0b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr064: frame #22: <unknown function> + 0x1d9a81 (0x5640a03d3a81 in /ext3/miniconda3/bin/python3.9)
gr064:
gr064: [2022-07-20 19:03:32,219] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678791
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678792
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678793
gr064: [2022-07-20 19:03:32,220] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 1678794
gr064: [2022-07-20 19:03:32,220] [ERROR] [launch.py:184:sigkill_handler] ['/ext3/miniconda3/bin/python3.9', '-u', 'Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py', '--local_rank=3', '--name', 'bigscience/bloom'] exits with return code = -6
pdsh@gr061: gr064: ssh exited with exit code 250
pdsh@gr061: gr062: ssh exited with exit code 250
pdsh@gr061: gr061: ssh exited with exit code 250

I've tried with CUDA 10.2 and 11.6 and there's no difference.

asaparov commented 2 years ago

@stas00

stas00 commented 2 years ago

Yeah, I get that too when I try to load too much of a batch size. But if you're running my script its default is bs=1 so shouldn't really be a problem. I haven't tried it on your setup. But the issue is on the DS-Inference side.

@RezaYazdaniAminabadi, as you can see both I and many others run into this issue - could we change the kernel code to be more defensive? It's always the same group.allreduce([tensor], opts) where it happens.

RezaYazdaniAminabadi commented 2 years ago

Hi @stas00 ,

Thanks for tagging me here. I will definitely look into this and try to fix it soon.

Best, Reza

stas00 commented 2 years ago

@asaparov, please run the following 2 experiments

same set up as your but add: CUDA_LAUNCH_BLOCKING=1 as in:

CUDA_LAUNCH_BLOCKING=1 deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom

and let's see if it starts working

does it fail in the same way if you use "bigscience/bloom-1b3" - just to check that it's the issue with size and not the setup/system. But don't use CUDA_LAUNCH_BLOCKING=1 this time. That is:

deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom-1b3

Thank you!

asaparov commented 2 years ago

@stas00 It seems to be working with CUDA_LAUNCH_BLOCKING=1!

I'll test with bigscience/bloom-1b3 next.

stas00 commented 2 years ago

Thank you for reporting back, @asaparov! You may use this way for now, it will be just a tad slower, until the underlying issue is resolved. The difficulty is in reproducing it.

@RezaYazdaniAminabadi, so @asaparov's success with CUDA_LAUNCH_BLOCKING=1 is pointing to some unsynchronized code in the kernels. As I proposed yesterday.

asaparov commented 2 years ago

@stas00 Actually I just tested both bigscience/bloom and bigscience/bloom-1b3 without CUDA_LAUNCH_BLOCKING=1 and they both work. This is probably because I pulled newer code from the bloom-inference branch of this repo (commit b76e516) and the code from the ds-inference/bloom-fix branch of DeepSpeed (commit f39c78f).

I had to fix a few bugs related to save_mp_checkpoint_path being set to False instead of None, but everything seems to work fine after that.

stas00 commented 2 years ago

I suspect that the bug is intermittent as it pops up in various situations and inconsistent. But if it works at the moment for you that's great!

Yes, the save_mp_checkpoint_path was just added and still being fixed up.

It basically allows you to set the tp-sharded path and then it'll save the new checkpoint - and the load time from it will be 1-2min instead of 10-20min. You may want to give it a try.

once the checkpoint is created you need to set parallelization="tp".

the 2 new changes are, the addition of save_mp_checkpoint_path to save the tp sharded weights on init.

kwargs["save_mp_checkpoint_path"] = checkpoint_dir

#checkpoints_json=None
model = deepspeed.init_inference(model,
                                 mp_size=world_size,
                                 dtype=torch.half,
                                 checkpoint=checkpoints_json,
                                 **kwargs,
                                 )

and the addition of parallelization in the checkpoint json format

checkpoint_type = "tp"
checkpoint_dir = "/home/nicolas_huggingface_co/src/Megatron-DeepSpeed/bloom-tp"

checkpoint_files = glob.glob(f"{checkpoint_dir}/*pt")
if len(checkpoint_files) == 0:
    # hf checkpoint
    checkpoint_files = get_checkpoint_files(model_name)
    checkpoint_type = "pp" # normal hf hub checkpoint

if rank == 0:
    print("Checkpoint files:", checkpoint_files)
    print("Checkpoint type:", checkpoint_type)

checkpoints_json = "checkpoints.json"
def write_checkponts_json():

    with io.open(checkpoints_json, 'w', encoding='utf-8') as f:

        data = {
            "type": "BLOOM-176B",
            "checkpoints": checkpoint_files,
            "version": 1.0,
            "parallelization": checkpoint_type,
        }

the 2 values are pp (normal hf checkpoint) and tp tp-sharded checkpoint.

I will make it all configurable once the dust settles.

RezaYazdaniAminabadi commented 2 years ago

Hi @asaparov

It's great to see your issue is solved. As @stas00 mentioned the part regarding the new checkpoint loading is coming soon too. @stas00, thanks for full details here :)

Best, Reza

pai4451 commented 2 years ago

@stas00 Actually I just tested both bigscience/bloom and bigscience/bloom-1b3 without CUDA_LAUNCH_BLOCKING=1 and they both work. This is probably because I pulled newer code from the bloom-inference branch of this repo (commit b76e516) and the code from the ds-inference/bloom-fix branch of DeepSpeed (commit f39c78f).

I had to fix a few bugs related to save_mp_checkpoint_path being set to False instead of None, but everything seems to work fine after that.

@asaparov Can you share your code for inference BLOOM or give me an idea on which inference repo did you use and did you make any code modification? I have the same hardware requirements as yours but I can’t get rid of CUDA errors even adding CUDA_LAUNCH_BLOCKING=1. I used the inference code on branch bloom-inference and DeepSpeed branch ds-inference/bloom-fix. Also did you set the environment variable WORLD_SIZE?

asaparov commented 2 years ago

@pai4451 I didn't change any code from this repo at all. I followed the installation instructions in the readme. I invoke the inference script using: deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom

I'm running everything in a conda environment in a singularity container. The output of conda info is:

Singularity> conda list
# packages in environment at /ext3/miniconda3:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
absl-py                   1.2.0                    pypi_0    pypi
aiohttp                   3.8.1                    pypi_0    pypi
aiosignal                 1.2.0                    pypi_0    pypi
apex                      0.1                      pypi_0    pypi
appdirs                   1.4.4                    pypi_0    pypi
async-timeout             4.0.2                    pypi_0    pypi
attrs                     21.4.0                   pypi_0    pypi
black                     21.4b0                   pypi_0    pypi
blas                      2.115                       mkl    conda-forge
blas-devel                3.9.0            15_linux64_mkl    conda-forge
brotlipy                  0.7.0           py39hb9d737c_1004    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2022.6.15            ha878542_0    conda-forge
cachetools                5.2.0                    pypi_0    pypi
certifi                   2022.6.15        py39hf3d152e_0    conda-forge
cffi                      1.15.1           py39he91dace_0    conda-forge
charset-normalizer        2.1.0              pyhd8ed1ab_0    conda-forge
click                     8.1.3                    pypi_0    pypi
colorama                  0.4.5              pyhd8ed1ab_0    conda-forge
conda                     4.13.0           py39hf3d152e_1    conda-forge
conda-package-handling    1.8.1            py39hb9d737c_1    conda-forge
cryptography              37.0.4           py39hd97740a_0    conda-forge
cudatoolkit               11.6.0              hecad31d_10    conda-forge
datasets                  2.4.0                    pypi_0    pypi
deepspeed                 0.7.0+f39c78f9            dev_0    <develop>
dill                      0.3.5.1                  pypi_0    pypi
filelock                  3.7.1                    pypi_0    pypi
frozenlist                1.3.0                    pypi_0    pypi
fsspec                    2022.5.0                 pypi_0    pypi
google-auth               2.9.1                    pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
grpcio                    1.47.0                   pypi_0    pypi
hjson                     3.0.2                    pypi_0    pypi
huggingface-hub           0.8.1                    pypi_0    pypi
idna                      3.3                pyhd8ed1ab_0    conda-forge
importlib-metadata        4.12.0                   pypi_0    pypi
isort                     5.10.1                   pypi_0    pypi
joblib                    1.1.0                    pypi_0    pypi
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libaio                    0.3.113              h5eee18b_0    <unknown>
libblas                   3.9.0            15_linux64_mkl    conda-forge
libcblas                  3.9.0            15_linux64_mkl    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
libgfortran-ng            12.1.0              h69a702a_16    conda-forge
libgfortran5              12.1.0              hdcd56e2_16    conda-forge
libgomp                   12.1.0              h8d9b700_16    conda-forge
liblapack                 3.9.0            15_linux64_mkl    conda-forge
liblapacke                3.9.0            15_linux64_mkl    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libstdcxx-ng              12.1.0              ha89aaad_16    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libzlib                   1.2.12               h166bdaf_2    conda-forge
llvm-openmp               14.0.4               he0ac6c6_0    conda-forge
markdown                  3.4.1                    pypi_0    pypi
markupsafe                2.1.1                    pypi_0    pypi
mkl                       2022.1.0           h84fe81f_915    conda-forge
mkl-devel                 2022.1.0           ha770c72_916    conda-forge
mkl-include               2022.1.0           h84fe81f_915    conda-forge
multidict                 6.0.2                    pypi_0    pypi
multiprocess              0.70.13                  pypi_0    pypi
mypy-extensions           0.4.3                    pypi_0    pypi
ncurses                   6.3                  h27087fc_1    conda-forge
ninja                     1.10.2.3                 pypi_0    pypi
nltk                      3.7                      pypi_0    pypi
numpy                     1.23.1                   pypi_0    pypi
oauthlib                  3.2.0                    pypi_0    pypi
openssl                   1.1.1q               h166bdaf_0    conda-forge
packaging                 21.3                     pypi_0    pypi
pandas                    1.4.3                    pypi_0    pypi
parameterized             0.8.1                    pypi_0    pypi
pathspec                  0.9.0                    pypi_0    pypi
pip                       22.2               pyhd8ed1ab_0    conda-forge
protobuf                  3.19.4                   pypi_0    pypi
psutil                    5.9.1                    pypi_0    pypi
py-cpuinfo                8.0.0                    pypi_0    pypi
pyarrow                   8.0.0                    pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pybind11                  2.10.0                   pypi_0    pypi
pycosat                   0.6.3           py39hb9d737c_1010    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pydantic                  1.9.1                    pypi_0    pypi
pyopenssl                 22.0.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.0.9                    pypi_0    pypi
pysocks                   1.7.1            py39hf3d152e_5    conda-forge
python                    3.9.13          h9a8a25e_0_cpython    conda-forge
python-dateutil           2.8.2                    pypi_0    pypi
python_abi                3.9                      2_cp39    conda-forge
pytorch                   1.12.0          py3.9_cuda11.6_cudnn8.3.2_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2022.1                   pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
readline                  8.1.2                h0f457ee_0    conda-forge
regex                     2022.7.25                pypi_0    pypi
requests                  2.28.1             pyhd8ed1ab_0    conda-forge
requests-oauthlib         1.3.1                    pypi_0    pypi
responses                 0.18.0                   pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
ruamel_yaml               0.15.80         py39hb9d737c_1007    conda-forge
setuptools                63.2.0           py39hf3d152e_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.39.2               h4ff8645_0    conda-forge
tbb                       2021.5.0             h924138e_1    conda-forge
tensorboard               2.9.1                    pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
tk                        8.6.12               h27826a3_0    conda-forge
tokenizers                0.12.1                   pypi_0    pypi
toml                      0.10.2                   pypi_0    pypi
tqdm                      4.64.0             pyhd8ed1ab_0    conda-forge
transformers              4.20.1                   pypi_0    pypi
typing_extensions         4.3.0              pyha770c72_0    conda-forge
tzdata                    2022a                h191b570_0    conda-forge
urllib3                   1.26.11            pyhd8ed1ab_0    conda-forge
werkzeug                  2.2.0                    pypi_0    pypi
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xxhash                    3.0.0                    pypi_0    pypi
xz                        5.2.5                h516909a_1    conda-forge
yaml                      0.2.5                h7f98852_2    conda-forge
yarl                      1.7.2                    pypi_0    pypi
zipp                      3.8.1                    pypi_0    pypi
zlib                      1.2.12               h166bdaf_2    conda-forge

For this repo and deepspeed, I'm using the commits that I mention above. I had a few errors from deepspeed complaining about save_mp_checkpoint_path which I fixed with the following changes:

diff --git a/deepspeed/__init__.py b/deepspeed/__init__.py
index 655d7a96..50049a2a 100755
--- a/deepspeed/__init__.py
+++ b/deepspeed/__init__.py
@@ -239,7 +239,7 @@ def init_inference(model,
                    moe_type='standard',
                    args=None,
                    enable_cuda_graph=False,
-                   save_mp_checkpoint_path=False):
+                   save_mp_checkpoint_path=None):
     """Initialize the DeepSpeed InferenceEngine.

     Arguments:
diff --git a/deepspeed/inference/engine.py b/deepspeed/inference/engine.py
index b5841dab..f380cd21 100755
--- a/deepspeed/inference/engine.py
+++ b/deepspeed/inference/engine.py
@@ -50,7 +50,7 @@ class InferenceEngine(Module):
                  moe_type='standard',
                  config=None,
                  enable_cuda_graph=False,
-                 save_mp_checkpoint_path=False):
+                 save_mp_checkpoint_path=None):
         """
         Args:
             model: torch.nn.Module
@@ -322,7 +322,7 @@ class InferenceEngine(Module):
                                 moe_type='standard',
                                 training_mp_size=1,
                                 checkpoint_dir=None,
-                                save_mp_checkpoint_path=False):
+                                save_mp_checkpoint_path=None):
         checkpoint, ckpt_type = SDLoaderFactory.get_sd_loader_json(
             checkpoint_dir) if checkpoint_dir is not None else (None, None)
         replace_transformer_layer(client_module,

I also had to make a few other edits to deepspeed since I wanted each worker to run within the singularity container, and to prevent ssh from complaining about host key authentication (I'm running this on a cluster).

pai4451 commented 2 years ago

@asaparov Thanks for the details. I can finally inference BLOOM with DeepSpeed on multiple nodes now. However, it only works for batch_size=1, and when I increase the batch size, error message RuntimeError: CUDA error: an illegal memory access was encountered throw out again. Do you have the same issue or can you inference with batch size more than 1 on you side? Thank you.

mayank31398 commented 2 years ago

Hmm, its not working for me even within a single node with batch size = 1, 8x A100 80gb Same, CUDA illegal memory access error

pohunghuang-nctu commented 2 years ago

Hmm, its not working for me even within a single node with batch size = 1, 8x A100 80gb Same, CUDA illegal memory access error

See if "NCCL WARN Call to ibv_reg_reg_mr failed" appearing on your log. In my case, we modify /etc/security/limits.conf to resolve it. you could find detail here. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html

But also only work for batch size == 1

mayank31398 commented 2 years ago

@pohunghuang-nctu nothing like that in my logs This is the full trace

[2022-07-26 11:41:08,472] [WARNING] [runner.py:159:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-07-26 11:41:11,508] [INFO] [runner.py:457:main] cmd = /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --benchmark
[2022-07-26 11:41:12,431] [INFO] [launch.py:96:main] 0 NCCL_IB_DISABLE=1
[2022-07-26 11:41:12,431] [INFO] [launch.py:96:main] 0 NCCL_DEBUG=INFO
[2022-07-26 11:41:12,431] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2022-07-26 11:41:12,431] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=8, node_rank=0
[2022-07-26 11:41:12,431] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2022-07-26 11:41:12,431] [INFO] [launch.py:123:main] dist_world_size=8
[2022-07-26 11:41:12,431] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2022-07-26 11:41:13,715] [INFO] [comm.py:423:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom
[2022-07-26 11:41:22,608] [INFO] [utils.py:827:see_memory_usage] pre-from-pretrained
[2022-07-26 11:41:22,608] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-26 11:41:22,608] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 11.2 GB, percent = 0.9%
[2022-07-26 11:41:22,745] [INFO] [utils.py:827:see_memory_usage] post-from-pretrained
[2022-07-26 11:41:22,746] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-26 11:41:22,746] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 11.21 GB, percent = 0.9%
[2022-07-26 11:41:22,795] [INFO] [utils.py:827:see_memory_usage] post-init-ds-zero-init
[2022-07-26 11:41:22,795] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-26 11:41:22,796] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 11.27 GB, percent = 0.9%
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.6
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO Using network Socket
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO Using network Socket
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO Using network Socket
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO Using network Socket
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 00 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 01 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 02 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 03 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 04 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 05 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 00 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 06 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 01 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 00 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 07 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 00 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 02 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 01 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 08 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 01 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 03 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 02 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 09 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 02 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 04 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 03 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 10 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 03 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 05 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 04 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 11 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 04 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 06 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 05 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 12 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 05 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 00 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 07 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 06 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 13 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 06 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 01 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 08 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 07 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 14 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 07 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 02 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 09 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 08 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 15 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 00 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 08 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 03 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 10 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 09 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 16 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 01 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 09 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 04 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 11 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 10 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 17 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 10 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 02 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 05 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 00 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 12 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 11 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 18 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 11 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 03 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 06 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 01 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 13 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 12 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 19 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 00 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 12 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 04 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 07 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 02 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 14 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 13 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 20 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 01 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 13 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 05 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 08 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 03 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 15 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 14 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 21 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 02 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 14 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 06 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 09 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 04 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 16 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 15 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 22 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 03 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 15 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 07 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 10 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 05 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 17 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 16 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 23 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 04 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 16 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 08 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 11 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 06 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 18 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 17 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 05 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 17 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 09 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 12 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 07 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 19 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 18 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 06 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 18 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 10 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 13 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 08 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 20 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 19 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 07 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 19 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 14 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 11 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 09 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 21 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 20 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 08 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 20 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 15 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 12 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 10 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 22 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 21 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 09 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 21 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 16 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 13 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 11 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 23 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 22 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 10 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 22 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 17 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 14 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 12 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 23 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 11 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 23 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 18 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 15 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 13 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 12 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 19 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 14 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 16 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 13 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 15 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 20 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 17 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 14 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 16 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 21 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 18 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 15 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 17 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 22 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 19 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 16 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 18 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 23 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 20 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 17 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 19 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 21 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 18 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 20 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 22 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 19 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 21 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 23 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 22 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 20 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 23 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 21 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 22 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 23 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Connected all rings
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Connected all rings
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Connected all rings
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 00 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Connected all rings
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 01 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 02 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 03 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 04 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 05 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 06 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 07 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 08 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 09 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 10 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 11 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 12 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 13 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 14 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 15 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 16 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 17 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 18 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 00 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 19 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 01 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 20 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 02 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 21 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 03 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 22 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 04 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 23 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 00 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 00 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 05 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 01 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 01 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 06 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 02 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 00 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 02 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 07 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 00 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 03 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 03 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 01 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 08 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 04 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 01 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 04 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 02 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 09 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 05 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 02 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 00 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 03 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 05 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 10 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 06 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 03 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 01 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 06 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 04 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 11 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 07 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 04 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 02 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 07 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 05 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 12 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 08 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 05 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 03 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 06 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 08 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 13 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 09 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 06 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 04 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 07 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 09 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 14 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 10 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 07 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 05 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 08 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 15 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 10 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 11 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 08 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 06 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 16 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 09 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 12 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 11 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 09 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 07 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 17 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 13 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 10 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 12 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 10 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 08 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 14 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 18 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 13 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 11 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 11 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 09 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 19 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 15 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 14 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 12 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 12 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 10 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 16 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 20 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 15 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 13 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 13 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 11 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 17 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 21 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 16 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 14 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 14 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 12 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 18 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 22 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 17 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 15 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 15 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 13 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 19 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 23 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 18 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 16 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 16 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 14 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 20 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 19 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 17 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 17 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 15 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 21 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 20 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 18 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 18 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 16 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 22 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 21 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 19 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 19 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 17 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 23 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 22 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 20 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 18 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 20 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 23 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 21 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 19 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 21 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 22 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 20 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 22 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 23 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 21 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 23 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 22 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 23 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Connected all trees
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Connected all trees
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Connected all trees
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Connected all trees
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Connected all trees
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Connected all trees
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Connected all trees
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Connected all trees
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO comm 0x7f6890002fb0 rank 1 nranks 8 cudaDev 1 busId 4080 - Init COMPLETE
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO comm 0x7fbcc4002fb0 rank 4 nranks 8 cudaDev 4 busId 40b0 - Init COMPLETE
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO comm 0x7f0b9c002fb0 rank 2 nranks 8 cudaDev 2 busId 4090 - Init COMPLETE
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO comm 0x7f09a0002fb0 rank 6 nranks 8 cudaDev 6 busId 40d0 - Init COMPLETE
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO comm 0x7f61d0002fb0 rank 3 nranks 8 cudaDev 3 busId 40a0 - Init COMPLETE
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO comm 0x7fbd04002fb0 rank 0 nranks 8 cudaDev 0 busId 4070 - Init COMPLETE
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Launch mode Parallel
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO comm 0x7f03dc002fb0 rank 5 nranks 8 cudaDev 5 busId 40c0 - Init COMPLETE
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO comm 0x7f1000002fb0 rank 7 nranks 8 cudaDev 7 busId 40e0 - Init COMPLETE
[2022-07-26 11:41:29,495] [INFO] [utils.py:827:see_memory_usage] pre-ds-inference-init
[2022-07-26 11:41:29,495] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2022-07-26 11:41:29,496] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 19.92 GB, percent = 1.6%
[2022-07-26 11:41:29,496] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.7.0+b6305d0e, git-hash=b6305d0e, git-branch=master
[2022-07-26 11:41:29,496] [INFO] [logging.py:69:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25245213508605957 seconds
[2022-07-26 11:41:30,151] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 14336, 'intermediate_size': 57344, 'heads': 112, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True}
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2497098445892334 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2436366081237793 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.24797964096069336 seconds
Time to load transformer_inference op: 0.24489784240722656 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2467021942138672 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.24748826026916504 seconds
Time to load transformer_inference op: 0.24941658973693848 seconds
Loading 72 checkpoint shards:   0%|          | 0/72 [11:08<?, ?it/s]9.89s/it]
[2022-07-26 11:52:39,789] [INFO] [engine.py:145:__init__] Place model to device: 6
Loading 72 checkpoint shards:   0%|          | 0/72 [11:09<?, ?it/s]
[2022-07-26 11:52:39,989] [INFO] [engine.py:145:__init__] Place model to device: 1
Loading 72 checkpoint shards:   0%|          | 0/72 [11:10<?, ?it/s]
[2022-07-26 11:52:41,127] [INFO] [engine.py:145:__init__] Place model to device: 3
Loading 72 checkpoint shards:   0%|          | 0/72 [11:14<?, ?it/s]
[2022-07-26 11:52:45,432] [INFO] [engine.py:145:__init__] Place model to device: 5
Loading 72 checkpoint shards:   0%|          | 0/72 [11:22<?, ?it/s]9.83s/it]
[2022-07-26 11:52:53,353] [INFO] [engine.py:145:__init__] Place model to device: 7
Loading 72 checkpoint shards:   0%|          | 0/72 [11:24<?, ?it/s]
[2022-07-26 11:52:55,107] [INFO] [engine.py:145:__init__] Place model to device: 2
Loading 72 checkpoint shards: 100%|██████████| 72/72 [11:24<00:00,  9.51s/it]
[2022-07-26 11:52:55,582] [INFO] [engine.py:145:__init__] Place model to device: 0
[2022-07-26 11:52:55,707] [INFO] [utils.py:827:see_memory_usage] post-ds-inference-init
[2022-07-26 11:52:55,708] [INFO] [utils.py:828:see_memory_usage] MA 47.04 GB         Max_MA 47.24 GB         CA 47.04 GB         Max_CA 47 GB
[2022-07-26 11:52:55,709] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 25.77 GB, percent = 2.0%
*** Starting to generate 100 tokens with bs=1
Generate args {'max_new_tokens': 100, 'do_sample': False}
Loading 72 checkpoint shards:   0%|          | 0/72 [11:25<?, ?it/s]
[2022-07-26 11:52:56,613] [INFO] [engine.py:145:__init__] Place model to device: 4
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6

llm-test-cluster-9:1281342:1283501 [1] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281344:1283502 [3] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281343:1283503 [2] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281347:1283504 [6] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281346:1283505 [5] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281348:1283506 [7] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281345:1283507 [4] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

llm-test-cluster-9:1281341:1283500 [0] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

asaparov commented 2 years ago

I get the same error for batch size > 1, even with CUDA_LAUNCH_BLOCKING=1:

gr062: RuntimeError: CUDA error: an illegal memory access was encountered
gr062: terminate called after throwing an instance of 'c10::CUDAError'
gr062:   what():  CUDA error: an illegal memory access was encountered
gr062: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr062: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7ad7777477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #1: <unknown function> + 0x1d4a3 (0x7f7b04d684a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f7b04d6e417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #3: <unknown function> + 0x458c68 (0x7f7b1755cc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f7ad775ad95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #5: <unknown function> + 0x34db35 (0x7f7b17451b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #6: <unknown function> + 0x681fc8 (0x7f7b17785fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f7b177862c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #8: <unknown function> + 0x127e28 (0x55bbd032ae28 in /ext3/miniconda3/bin/python3.9)
gr062: frame #9: <unknown function> + 0x134ad8 (0x55bbd0337ad8 in /ext3/miniconda3/bin/python3.9)
gr062: frame #10: <unknown function> + 0x1487ce (0x55bbd034b7ce in /ext3/miniconda3/bin/python3.9)
gr062: frame #11: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #12: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #13: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #14: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #15: <unknown function> + 0x11c661 (0x55bbd031f661 in /ext3/miniconda3/bin/python3.9)
gr062: frame #16: PyDict_SetItemString + 0x4a (0x55bbd032581a in /ext3/miniconda3/bin/python3.9)
gr062: frame #17: <unknown function> + 0x214aec (0x55bbd0417aec in /ext3/miniconda3/bin/python3.9)
gr062: frame #18: Py_FinalizeEx + 0x186 (0x55bbd0416f56 in /ext3/miniconda3/bin/python3.9)
gr062: frame #19: Py_RunMain + 0x112 (0x55bbd040a2b2 in /ext3/miniconda3/bin/python3.9)
gr062: frame #20: Py_BytesMain + 0x39 (0x55bbd03dcb79 in /ext3/miniconda3/bin/python3.9)
gr062: frame #21: __libc_start_main + 0xf3 (0x7f7b5cb060b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr062: frame #22: <unknown function> + 0x1d9a81 (0x55bbd03dca81 in /ext3/miniconda3/bin/python3.9)

@stas00 @RezaYazdaniAminabadi

pai4451 commented 2 years ago

I get the same error for batch size > 1:

gr062: RuntimeError: CUDA error: an illegal memory access was encountered
gr062: terminate called after throwing an instance of 'c10::CUDAError'
gr062:   what():  CUDA error: an illegal memory access was encountered
gr062: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr062: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7ad7777477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #1: <unknown function> + 0x1d4a3 (0x7f7b04d684a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f7b04d6e417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #3: <unknown function> + 0x458c68 (0x7f7b1755cc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f7ad775ad95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #5: <unknown function> + 0x34db35 (0x7f7b17451b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #6: <unknown function> + 0x681fc8 (0x7f7b17785fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f7b177862c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #8: <unknown function> + 0x127e28 (0x55bbd032ae28 in /ext3/miniconda3/bin/python3.9)
gr062: frame #9: <unknown function> + 0x134ad8 (0x55bbd0337ad8 in /ext3/miniconda3/bin/python3.9)
gr062: frame #10: <unknown function> + 0x1487ce (0x55bbd034b7ce in /ext3/miniconda3/bin/python3.9)
gr062: frame #11: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #12: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #13: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #14: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #15: <unknown function> + 0x11c661 (0x55bbd031f661 in /ext3/miniconda3/bin/python3.9)
gr062: frame #16: PyDict_SetItemString + 0x4a (0x55bbd032581a in /ext3/miniconda3/bin/python3.9)
gr062: frame #17: <unknown function> + 0x214aec (0x55bbd0417aec in /ext3/miniconda3/bin/python3.9)
gr062: frame #18: Py_FinalizeEx + 0x186 (0x55bbd0416f56 in /ext3/miniconda3/bin/python3.9)
gr062: frame #19: Py_RunMain + 0x112 (0x55bbd040a2b2 in /ext3/miniconda3/bin/python3.9)
gr062: frame #20: Py_BytesMain + 0x39 (0x55bbd03dcb79 in /ext3/miniconda3/bin/python3.9)
gr062: frame #21: __libc_start_main + 0xf3 (0x7f7b5cb060b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr062: frame #22: <unknown function> + 0x1d9a81 (0x55bbd03dca81 in /ext3/miniconda3/bin/python3.9)

@asaparov Okay, at least this is reproducible, thanks.

mayank31398 commented 2 years ago

I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?

pai4451 commented 2 years ago

I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?

What is your CUDA version and DeepSpeed? I personally had CUDA11.5 and DeepSpeed 0.7.0 installed from ds-inference/bloom-fix branch, and I can inference BLOOM with batch size equal to 1 on two nodes.

mayank31398 commented 2 years ago

I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?

What is your CUDA version and DeepSpeed? I personally had CUDA11.5 and DeepSpeed 0.7.0 installed from ds-inference/bloom-fix branch, and I can inference BLOOM with batch size equal to 1 on two nodes.

I am using CUDA-11.6 and deepspeed is built from master

asaparov commented 2 years ago

@mayank31398 Perhaps try the ds-inference/bloom-fix branch of deepspeed?

mayank31398 commented 2 years ago

@mayank31398 Perhaps try the ds-inference/bloom-fix branch of deepspeed?

Ill try this today. thanks

asaparov commented 2 years ago

Actually, I just tried running with larger batch sizes (16 and 32) and it doesn't run into the "CUDA illegal memory access" error (as I did with batch size=2). Maybe it is intermittent? Or maybe something's wrong with batch size 2 specifically.

pohunghuang-nctu commented 2 years ago

Actually, I just tried running with larger batch sizes (16 and 32) and it doesn't run into the "CUDA illegal memory access" error (as I did with batch size=2). Maybe it is intermittent? Or maybe something's wrong with batch size 2 specifically.

We (with @pai4451) tried batch_size from 8 to 2, all of them failed. but yet try batch_size > 8. Pai will test it today to see what happen in our side.

pai4451 commented 2 years ago

@asaparov I tried the inference script with batch sizes = 1, 2, 4, 8, 16, 32, 64 and 128. Only batch sizes equal 1 and 32 work, which is a bit surprising. Anyways we’ll have to wait someone to fix the issue in this repo.

RezaYazdaniAminabadi commented 2 years ago

Hi all,

There are some new changes merged at DeepSpeed master. Would you mind trying that? I have tried with batch 1 and 128 and both are working on my side (I ran it on 8 A100 80GB). I will try on A100-40G as well to make sure all is fine. Also, you can now generate MP-sharded checkpoints to load the model much faster. You can find more information in this PR: https://github.com/microsoft/DeepSpeed/pull/2132 Thanks, Reza

pohunghuang-nctu commented 2 years ago

@RezaYazdaniAminabadi could you give some hint (where to get the doc) about "generate MP-sharded checkpoints"? So far we have only the 70 .bin files downloaded from huggingface. Do you mean there's a tool re-formatting these 70 files into world-size pieces to speed up model loading? Thanks in advance.

RezaYazdaniAminabadi commented 2 years ago

Hi @pohunghuang-nctu

Sure, you need to pass save_mp_checkpoint_path to the init_inference method in order to save the tp-sharded checkpoints in the path you specified. You will see that after loading the checkpoint, DeepSpeed starts saving the new checkpoints, and you will eventually have the tp-sharded checkpoints. In addition, there will be a json config file saved in that path (like bloom_ds-inference-config.json) that you can pass as the checkpoint argument to init_inference in the next run. Note that you can remove save_mp_checkpoint_path after you save the tp-sharded checkpoints for the first time, so that DeepSpeed doesn't always save a new checkpoint for you.

Best, Reza

zcrypt0 commented 2 years ago

@RezaYazdaniAminabadi I was testing with the newly merged code last night but still hit the illegal memory accesses intermittently on the larger batch sizes. It wasn't like throwing a dice though, it would work for like a half hour and then stop working for another block of time and then start working again.

For the first time I was able to use some larger batch sizes though (at least part of the time), so something seems to have improved.

EDIT: these tests were on 8x A100 80GB

RezaYazdaniAminabadi commented 2 years ago

I am glad you could run it with large batch now! :) I think this might be related to some cache allocation issues. We are working on optimizing that part too.

pai4451 commented 2 years ago

@RezaYazdaniAminabadi I used the master branch of DeepSpeed to run the inference script. But this illegal memory access is still occurring when input prompt is long for batch size 1. For larger batch sizes, I can inference from 8 up to 32. But somehow the illegal memory error appeared for batch size 2 and 4.

bigscience-workshop / Megatron-DeepSpeed

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch) #318