Open asaparov opened 2 years ago
@stas00
Yeah, I get that too when I try to load too much of a batch size. But if you're running my script its default is bs=1 so shouldn't really be a problem. I haven't tried it on your setup. But the issue is on the DS-Inference side.
@RezaYazdaniAminabadi, as you can see both I and many others run into this issue - could we change the kernel code to be more defensive? It's always the same group.allreduce([tensor], opts)
where it happens.
Hi @stas00 ,
Thanks for tagging me here. I will definitely look into this and try to fix it soon.
Best, Reza
@asaparov, please run the following 2 experiments
CUDA_LAUNCH_BLOCKING=1
as in:CUDA_LAUNCH_BLOCKING=1 deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom
and let's see if it starts working
CUDA_LAUNCH_BLOCKING=1
this time. That is:deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom-1b3
Thank you!
@stas00 It seems to be working with CUDA_LAUNCH_BLOCKING=1
!
I'll test with bigscience/bloom-1b3
next.
Thank you for reporting back, @asaparov! You may use this way for now, it will be just a tad slower, until the underlying issue is resolved. The difficulty is in reproducing it.
@RezaYazdaniAminabadi, so @asaparov's success with CUDA_LAUNCH_BLOCKING=1
is pointing to some unsynchronized code in the kernels. As I proposed yesterday.
@stas00 Actually I just tested both bigscience/bloom
and bigscience/bloom-1b3
without CUDA_LAUNCH_BLOCKING=1
and they both work. This is probably because I pulled newer code from the bloom-inference
branch of this repo (commit b76e516) and the code from the ds-inference/bloom-fix
branch of DeepSpeed (commit f39c78f).
I had to fix a few bugs related to save_mp_checkpoint_path
being set to False
instead of None
, but everything seems to work fine after that.
I suspect that the bug is intermittent as it pops up in various situations and inconsistent. But if it works at the moment for you that's great!
Yes, the save_mp_checkpoint_path
was just added and still being fixed up.
It basically allows you to set the tp-sharded path and then it'll save the new checkpoint - and the load time from it will be 1-2min instead of 10-20min. You may want to give it a try.
once the checkpoint is created you need to set parallelization="tp"
.
the 2 new changes are, the addition of save_mp_checkpoint_path to save the tp sharded weights on init.
kwargs["save_mp_checkpoint_path"] = checkpoint_dir
#checkpoints_json=None
model = deepspeed.init_inference(model,
mp_size=world_size,
dtype=torch.half,
checkpoint=checkpoints_json,
**kwargs,
)
and the addition of parallelization in the checkpoint json format
checkpoint_type = "tp"
checkpoint_dir = "/home/nicolas_huggingface_co/src/Megatron-DeepSpeed/bloom-tp"
checkpoint_files = glob.glob(f"{checkpoint_dir}/*pt")
if len(checkpoint_files) == 0:
# hf checkpoint
checkpoint_files = get_checkpoint_files(model_name)
checkpoint_type = "pp" # normal hf hub checkpoint
if rank == 0:
print("Checkpoint files:", checkpoint_files)
print("Checkpoint type:", checkpoint_type)
checkpoints_json = "checkpoints.json"
def write_checkponts_json():
with io.open(checkpoints_json, 'w', encoding='utf-8') as f:
data = {
"type": "BLOOM-176B",
"checkpoints": checkpoint_files,
"version": 1.0,
"parallelization": checkpoint_type,
}
the 2 values are pp (normal hf checkpoint) and tp tp-sharded checkpoint.
I will make it all configurable once the dust settles.
Hi @asaparov
It's great to see your issue is solved. As @stas00 mentioned the part regarding the new checkpoint loading is coming soon too. @stas00, thanks for full details here :)
Best, Reza
@stas00 Actually I just tested both
bigscience/bloom
andbigscience/bloom-1b3
without CUDA_LAUNCH_BLOCKING=1 and they both work. This is probably because I pulled newer code from thebloom-inference
branch of this repo (commit b76e516) and the code from theds-inference/bloom-fix
branch of DeepSpeed (commit f39c78f).I had to fix a few bugs related to
save_mp_checkpoint_path
being set toFalse
instead ofNone
, but everything seems to work fine afterthat.
@asaparov Can you share your code for inference BLOOM or give me an idea on which inference repo did you use and did you make any code modification? I have the same hardware requirements as yours but I can’t get rid of CUDA errors even adding CUDA_LAUNCH_BLOCKING=1
. I used the inference code on branch bloom-inference
and DeepSpeed branch ds-inference/bloom-fix
. Also did you set the environment variable WORLD_SIZE
?
@pai4451 I didn't change any code from this repo at all. I followed the installation instructions in the readme. I invoke the inference script using:
deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom
I'm running everything in a conda environment in a singularity container. The output of conda info
is:
Singularity> conda list
# packages in environment at /ext3/miniconda3:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_kmp_llvm conda-forge
absl-py 1.2.0 pypi_0 pypi
aiohttp 3.8.1 pypi_0 pypi
aiosignal 1.2.0 pypi_0 pypi
apex 0.1 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
async-timeout 4.0.2 pypi_0 pypi
attrs 21.4.0 pypi_0 pypi
black 21.4b0 pypi_0 pypi
blas 2.115 mkl conda-forge
blas-devel 3.9.0 15_linux64_mkl conda-forge
brotlipy 0.7.0 py39hb9d737c_1004 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
ca-certificates 2022.6.15 ha878542_0 conda-forge
cachetools 5.2.0 pypi_0 pypi
certifi 2022.6.15 py39hf3d152e_0 conda-forge
cffi 1.15.1 py39he91dace_0 conda-forge
charset-normalizer 2.1.0 pyhd8ed1ab_0 conda-forge
click 8.1.3 pypi_0 pypi
colorama 0.4.5 pyhd8ed1ab_0 conda-forge
conda 4.13.0 py39hf3d152e_1 conda-forge
conda-package-handling 1.8.1 py39hb9d737c_1 conda-forge
cryptography 37.0.4 py39hd97740a_0 conda-forge
cudatoolkit 11.6.0 hecad31d_10 conda-forge
datasets 2.4.0 pypi_0 pypi
deepspeed 0.7.0+f39c78f9 dev_0 <develop>
dill 0.3.5.1 pypi_0 pypi
filelock 3.7.1 pypi_0 pypi
frozenlist 1.3.0 pypi_0 pypi
fsspec 2022.5.0 pypi_0 pypi
google-auth 2.9.1 pypi_0 pypi
google-auth-oauthlib 0.4.6 pypi_0 pypi
grpcio 1.47.0 pypi_0 pypi
hjson 3.0.2 pypi_0 pypi
huggingface-hub 0.8.1 pypi_0 pypi
idna 3.3 pyhd8ed1ab_0 conda-forge
importlib-metadata 4.12.0 pypi_0 pypi
isort 5.10.1 pypi_0 pypi
joblib 1.1.0 pypi_0 pypi
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libaio 0.3.113 h5eee18b_0 <unknown>
libblas 3.9.0 15_linux64_mkl conda-forge
libcblas 3.9.0 15_linux64_mkl conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 12.1.0 h8d9b700_16 conda-forge
libgfortran-ng 12.1.0 h69a702a_16 conda-forge
libgfortran5 12.1.0 hdcd56e2_16 conda-forge
libgomp 12.1.0 h8d9b700_16 conda-forge
liblapack 3.9.0 15_linux64_mkl conda-forge
liblapacke 3.9.0 15_linux64_mkl conda-forge
libnsl 2.0.0 h7f98852_0 conda-forge
libstdcxx-ng 12.1.0 ha89aaad_16 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
libzlib 1.2.12 h166bdaf_2 conda-forge
llvm-openmp 14.0.4 he0ac6c6_0 conda-forge
markdown 3.4.1 pypi_0 pypi
markupsafe 2.1.1 pypi_0 pypi
mkl 2022.1.0 h84fe81f_915 conda-forge
mkl-devel 2022.1.0 ha770c72_916 conda-forge
mkl-include 2022.1.0 h84fe81f_915 conda-forge
multidict 6.0.2 pypi_0 pypi
multiprocess 0.70.13 pypi_0 pypi
mypy-extensions 0.4.3 pypi_0 pypi
ncurses 6.3 h27087fc_1 conda-forge
ninja 1.10.2.3 pypi_0 pypi
nltk 3.7 pypi_0 pypi
numpy 1.23.1 pypi_0 pypi
oauthlib 3.2.0 pypi_0 pypi
openssl 1.1.1q h166bdaf_0 conda-forge
packaging 21.3 pypi_0 pypi
pandas 1.4.3 pypi_0 pypi
parameterized 0.8.1 pypi_0 pypi
pathspec 0.9.0 pypi_0 pypi
pip 22.2 pyhd8ed1ab_0 conda-forge
protobuf 3.19.4 pypi_0 pypi
psutil 5.9.1 pypi_0 pypi
py-cpuinfo 8.0.0 pypi_0 pypi
pyarrow 8.0.0 pypi_0 pypi
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pybind11 2.10.0 pypi_0 pypi
pycosat 0.6.3 py39hb9d737c_1010 conda-forge
pycparser 2.21 pyhd8ed1ab_0 conda-forge
pydantic 1.9.1 pypi_0 pypi
pyopenssl 22.0.0 pyhd8ed1ab_0 conda-forge
pyparsing 3.0.9 pypi_0 pypi
pysocks 1.7.1 py39hf3d152e_5 conda-forge
python 3.9.13 h9a8a25e_0_cpython conda-forge
python-dateutil 2.8.2 pypi_0 pypi
python_abi 3.9 2_cp39 conda-forge
pytorch 1.12.0 py3.9_cuda11.6_cudnn8.3.2_0 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2022.1 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
readline 8.1.2 h0f457ee_0 conda-forge
regex 2022.7.25 pypi_0 pypi
requests 2.28.1 pyhd8ed1ab_0 conda-forge
requests-oauthlib 1.3.1 pypi_0 pypi
responses 0.18.0 pypi_0 pypi
rsa 4.9 pypi_0 pypi
ruamel_yaml 0.15.80 py39hb9d737c_1007 conda-forge
setuptools 63.2.0 py39hf3d152e_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
sqlite 3.39.2 h4ff8645_0 conda-forge
tbb 2021.5.0 h924138e_1 conda-forge
tensorboard 2.9.1 pypi_0 pypi
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.1 pypi_0 pypi
tk 8.6.12 h27826a3_0 conda-forge
tokenizers 0.12.1 pypi_0 pypi
toml 0.10.2 pypi_0 pypi
tqdm 4.64.0 pyhd8ed1ab_0 conda-forge
transformers 4.20.1 pypi_0 pypi
typing_extensions 4.3.0 pyha770c72_0 conda-forge
tzdata 2022a h191b570_0 conda-forge
urllib3 1.26.11 pyhd8ed1ab_0 conda-forge
werkzeug 2.2.0 pypi_0 pypi
wheel 0.37.1 pyhd8ed1ab_0 conda-forge
xxhash 3.0.0 pypi_0 pypi
xz 5.2.5 h516909a_1 conda-forge
yaml 0.2.5 h7f98852_2 conda-forge
yarl 1.7.2 pypi_0 pypi
zipp 3.8.1 pypi_0 pypi
zlib 1.2.12 h166bdaf_2 conda-forge
For this repo and deepspeed, I'm using the commits that I mention above. I had a few errors from deepspeed complaining about save_mp_checkpoint_path
which I fixed with the following changes:
diff --git a/deepspeed/__init__.py b/deepspeed/__init__.py
index 655d7a96..50049a2a 100755
--- a/deepspeed/__init__.py
+++ b/deepspeed/__init__.py
@@ -239,7 +239,7 @@ def init_inference(model,
moe_type='standard',
args=None,
enable_cuda_graph=False,
- save_mp_checkpoint_path=False):
+ save_mp_checkpoint_path=None):
"""Initialize the DeepSpeed InferenceEngine.
Arguments:
diff --git a/deepspeed/inference/engine.py b/deepspeed/inference/engine.py
index b5841dab..f380cd21 100755
--- a/deepspeed/inference/engine.py
+++ b/deepspeed/inference/engine.py
@@ -50,7 +50,7 @@ class InferenceEngine(Module):
moe_type='standard',
config=None,
enable_cuda_graph=False,
- save_mp_checkpoint_path=False):
+ save_mp_checkpoint_path=None):
"""
Args:
model: torch.nn.Module
@@ -322,7 +322,7 @@ class InferenceEngine(Module):
moe_type='standard',
training_mp_size=1,
checkpoint_dir=None,
- save_mp_checkpoint_path=False):
+ save_mp_checkpoint_path=None):
checkpoint, ckpt_type = SDLoaderFactory.get_sd_loader_json(
checkpoint_dir) if checkpoint_dir is not None else (None, None)
replace_transformer_layer(client_module,
I also had to make a few other edits to deepspeed since I wanted each worker to run within the singularity container, and to prevent ssh from complaining about host key authentication (I'm running this on a cluster).
@asaparov Thanks for the details. I can finally inference BLOOM with DeepSpeed on multiple nodes now. However, it only works for batch_size=1
, and when I increase the batch size, error message RuntimeError: CUDA error: an illegal memory access was encountered
throw out again. Do you have the same issue or can you inference with batch size more than 1 on you side? Thank you.
Hmm, its not working for me even within a single node with batch size = 1, 8x A100 80gb Same, CUDA illegal memory access error
Hmm, its not working for me even within a single node with batch size = 1, 8x A100 80gb Same, CUDA illegal memory access error
See if "NCCL WARN Call to ibv_reg_reg_mr failed" appearing on your log. In my case, we modify /etc/security/limits.conf to resolve it. you could find detail here. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
But also only work for batch size == 1
@pohunghuang-nctu nothing like that in my logs This is the full trace
[2022-07-26 11:41:08,472] [WARNING] [runner.py:159:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-07-26 11:41:11,508] [INFO] [runner.py:457:main] cmd = /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 scripts/inference/bloom-ds-inference.py --name bigscience/bloom --benchmark
[2022-07-26 11:41:12,431] [INFO] [launch.py:96:main] 0 NCCL_IB_DISABLE=1
[2022-07-26 11:41:12,431] [INFO] [launch.py:96:main] 0 NCCL_DEBUG=INFO
[2022-07-26 11:41:12,431] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2022-07-26 11:41:12,431] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=8, node_rank=0
[2022-07-26 11:41:12,431] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2022-07-26 11:41:12,431] [INFO] [launch.py:123:main] dist_world_size=8
[2022-07-26 11:41:12,431] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2022-07-26 11:41:13,715] [INFO] [comm.py:423:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom
[2022-07-26 11:41:22,608] [INFO] [utils.py:827:see_memory_usage] pre-from-pretrained
[2022-07-26 11:41:22,608] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2022-07-26 11:41:22,608] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 11.2 GB, percent = 0.9%
[2022-07-26 11:41:22,745] [INFO] [utils.py:827:see_memory_usage] post-from-pretrained
[2022-07-26 11:41:22,746] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2022-07-26 11:41:22,746] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 11.21 GB, percent = 0.9%
[2022-07-26 11:41:22,795] [INFO] [utils.py:827:see_memory_usage] post-init-ds-zero-init
[2022-07-26 11:41:22,795] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2022-07-26 11:41:22,796] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 11.27 GB, percent = 0.9%
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.6
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO Bootstrap : Using eth0:10.241.128.4<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281346:1281346 [5] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO NET/Socket : Using [0]eth0:10.241.128.4<0> [1]eth1:10.241.129.13<0>
llm-test-cluster-9:1281347:1281347 [6] NCCL INFO Using network Socket
llm-test-cluster-9:1281342:1281342 [1] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281344 [3] NCCL INFO Using network Socket
llm-test-cluster-9:1281348:1281348 [7] NCCL INFO Using network Socket
llm-test-cluster-9:1281343:1281343 [2] NCCL INFO Using network Socket
llm-test-cluster-9:1281345:1281345 [4] NCCL INFO Using network Socket
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 00 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 01 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 02 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 03 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 04 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 05 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 00 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 06 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 01 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 00 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 07 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 00 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 02 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 01 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 08 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 01 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 03 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 02 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 09 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 02 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 04 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 03 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 10 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 03 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 05 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 04 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 11 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 04 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 06 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 05 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 12 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 05 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 00 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 07 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 06 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 13 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 06 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 01 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 08 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 07 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 14 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 07 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 02 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 09 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 08 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 15 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 00 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 08 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 03 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 10 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 09 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 16 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 01 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 09 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 04 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 11 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 10 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 17 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 10 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 02 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 05 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 00 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 12 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 11 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 18 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 11 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 03 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 06 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 01 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 13 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 12 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 19 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 00 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 12 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 04 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 07 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 02 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 14 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 13 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 20 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 01 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 13 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 05 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 08 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 03 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 15 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 14 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 21 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 02 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 14 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 06 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 09 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 04 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 16 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 15 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 22 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 03 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 15 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 07 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 10 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 05 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 17 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 16 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 23 : 3[40a0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 04 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 16 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 08 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 11 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 06 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 18 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 17 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 05 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 17 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 09 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 12 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 07 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 19 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 18 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 06 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 18 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 10 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 13 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 08 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 20 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 19 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 07 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 19 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 14 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 11 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 09 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 21 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 20 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 08 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 20 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 15 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 12 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 10 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 22 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 21 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 09 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 21 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 16 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 13 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 11 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 23 : 7[40e0] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 22 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 10 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 22 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 17 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 14 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 12 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 23 : 5[40c0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 11 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 23 : 1[4080] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 18 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 15 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 13 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 12 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 19 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 14 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 16 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 13 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 15 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 20 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 17 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 14 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 16 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 21 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 18 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 15 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 17 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 22 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 19 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 16 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 18 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 23 : 6[40d0] -> 7[40e0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 20 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 17 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 19 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 21 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 18 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 20 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 22 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 19 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 21 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Channel 23 : 0[4070] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 22 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 20 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 23 : 2[4090] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 21 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 22 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 23 : 4[40b0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Connected all rings
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Connected all rings
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Connected all rings
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 00 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Connected all rings
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 01 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Connected all rings
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 02 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 03 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 04 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 05 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 06 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 07 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 08 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 09 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 10 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 11 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 12 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 13 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 14 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 15 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 16 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 17 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 18 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 00 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 19 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 01 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 20 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 02 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 21 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 03 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 22 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 04 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Channel 23 : 7[40e0] -> 6[40d0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 00 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 00 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 05 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 01 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 01 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 06 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 02 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 00 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 02 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 07 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 00 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 03 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 03 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 01 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 08 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 04 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 01 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 04 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 02 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 09 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 05 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 02 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 00 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 03 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 05 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 10 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 06 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 03 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 01 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 06 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 04 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 11 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 07 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 04 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 02 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 07 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 05 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 12 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 08 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 05 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 03 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 06 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 08 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 13 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 09 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 06 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 04 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 07 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 09 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 14 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 10 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 07 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 05 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 08 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 15 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 10 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 11 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 08 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 06 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 16 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 09 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 12 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 11 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 09 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 07 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 17 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 13 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 10 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 12 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 10 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 08 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 14 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 18 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 13 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 11 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 11 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 09 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 19 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 15 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 14 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 12 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 12 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 10 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 16 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 20 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 15 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 13 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 13 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 11 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 17 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 21 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 16 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 14 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 14 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 12 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 18 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 22 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 17 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 15 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 15 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 13 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 19 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Channel 23 : 6[40d0] -> 5[40c0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 18 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 16 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 16 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 14 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 20 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 19 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 17 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 17 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 15 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 21 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 20 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 18 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 18 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 16 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 22 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 21 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 19 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 19 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 17 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Channel 23 : 1[4080] -> 0[4070] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 22 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 20 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 18 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 20 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Channel 23 : 2[4090] -> 1[4080] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 21 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 19 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 21 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 22 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 20 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 22 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Channel 23 : 3[40a0] -> 2[4090] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 21 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Channel 23 : 4[40b0] -> 3[40a0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 22 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Channel 23 : 5[40c0] -> 4[40b0] via P2P/IPC/read
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO Connected all trees
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO Connected all trees
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO Connected all trees
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO Connected all trees
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO Connected all trees
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO Connected all trees
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO Connected all trees
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO Connected all trees
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
llm-test-cluster-9:1281342:1281408 [1] NCCL INFO comm 0x7f6890002fb0 rank 1 nranks 8 cudaDev 1 busId 4080 - Init COMPLETE
llm-test-cluster-9:1281345:1281410 [4] NCCL INFO comm 0x7fbcc4002fb0 rank 4 nranks 8 cudaDev 4 busId 40b0 - Init COMPLETE
llm-test-cluster-9:1281343:1281406 [2] NCCL INFO comm 0x7f0b9c002fb0 rank 2 nranks 8 cudaDev 2 busId 4090 - Init COMPLETE
llm-test-cluster-9:1281347:1281407 [6] NCCL INFO comm 0x7f09a0002fb0 rank 6 nranks 8 cudaDev 6 busId 40d0 - Init COMPLETE
llm-test-cluster-9:1281344:1281405 [3] NCCL INFO comm 0x7f61d0002fb0 rank 3 nranks 8 cudaDev 3 busId 40a0 - Init COMPLETE
llm-test-cluster-9:1281341:1281403 [0] NCCL INFO comm 0x7fbd04002fb0 rank 0 nranks 8 cudaDev 0 busId 4070 - Init COMPLETE
llm-test-cluster-9:1281341:1281341 [0] NCCL INFO Launch mode Parallel
llm-test-cluster-9:1281346:1281404 [5] NCCL INFO comm 0x7f03dc002fb0 rank 5 nranks 8 cudaDev 5 busId 40c0 - Init COMPLETE
llm-test-cluster-9:1281348:1281409 [7] NCCL INFO comm 0x7f1000002fb0 rank 7 nranks 8 cudaDev 7 busId 40e0 - Init COMPLETE
[2022-07-26 11:41:29,495] [INFO] [utils.py:827:see_memory_usage] pre-ds-inference-init
[2022-07-26 11:41:29,495] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2022-07-26 11:41:29,496] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 19.92 GB, percent = 1.6%
[2022-07-26 11:41:29,496] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.7.0+b6305d0e, git-hash=b6305d0e, git-branch=master
[2022-07-26 11:41:29,496] [INFO] [logging.py:69:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /net/llm-shared-nfs/nfs/mayank/.cache/torch_extensions/py38_cu116/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.25245213508605957 seconds
[2022-07-26 11:41:30,151] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 14336, 'intermediate_size': 57344, 'heads': 112, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True}
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2497098445892334 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2436366081237793 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.24797964096069336 seconds
Time to load transformer_inference op: 0.24489784240722656 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.2467021942138672 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.24748826026916504 seconds
Time to load transformer_inference op: 0.24941658973693848 seconds
Loading 72 checkpoint shards: 0%| | 0/72 [11:08<?, ?it/s]9.89s/it]
[2022-07-26 11:52:39,789] [INFO] [engine.py:145:__init__] Place model to device: 6
Loading 72 checkpoint shards: 0%| | 0/72 [11:09<?, ?it/s]
[2022-07-26 11:52:39,989] [INFO] [engine.py:145:__init__] Place model to device: 1
Loading 72 checkpoint shards: 0%| | 0/72 [11:10<?, ?it/s]
[2022-07-26 11:52:41,127] [INFO] [engine.py:145:__init__] Place model to device: 3
Loading 72 checkpoint shards: 0%| | 0/72 [11:14<?, ?it/s]
[2022-07-26 11:52:45,432] [INFO] [engine.py:145:__init__] Place model to device: 5
Loading 72 checkpoint shards: 0%| | 0/72 [11:22<?, ?it/s]9.83s/it]
[2022-07-26 11:52:53,353] [INFO] [engine.py:145:__init__] Place model to device: 7
Loading 72 checkpoint shards: 0%| | 0/72 [11:24<?, ?it/s]
[2022-07-26 11:52:55,107] [INFO] [engine.py:145:__init__] Place model to device: 2
Loading 72 checkpoint shards: 100%|██████████| 72/72 [11:24<00:00, 9.51s/it]
[2022-07-26 11:52:55,582] [INFO] [engine.py:145:__init__] Place model to device: 0
[2022-07-26 11:52:55,707] [INFO] [utils.py:827:see_memory_usage] post-ds-inference-init
[2022-07-26 11:52:55,708] [INFO] [utils.py:828:see_memory_usage] MA 47.04 GB Max_MA 47.24 GB CA 47.04 GB Max_CA 47 GB
[2022-07-26 11:52:55,709] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 25.77 GB, percent = 2.0%
*** Starting to generate 100 tokens with bs=1
Generate args {'max_new_tokens': 100, 'do_sample': False}
Loading 72 checkpoint shards: 0%| | 0/72 [11:25<?, ?it/s]
[2022-07-26 11:52:56,613] [INFO] [engine.py:145:__init__] Place model to device: 4
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
llm-test-cluster-9:1281342:1283501 [1] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281342:1283501 [1] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281344:1283502 [3] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281344:1283502 [3] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281343:1283503 [2] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281343:1283503 [2] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281347:1283504 [6] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281347:1283504 [6] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281346:1283505 [5] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281346:1283505 [5] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281348:1283506 [7] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281348:1283506 [7] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281345:1283507 [4] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281345:1283507 [4] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
llm-test-cluster-9:1281341:1283500 [0] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1281341:1283500 [0] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
_ = generate()
File "scripts/inference/bloom-ds-inference.py", line 244, in generate
outputs = model.generate(**input_tokens, **generate_kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
return self.greedy_search(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
outputs = self(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/inference/engine.py", line 508, in forward
outputs = self.model_orig_fwd(*inputs, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
transformer_outputs = self.transformer(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
outputs = block(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
self.attention(input,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
output = DeepSpeedSelfAttentionFunction.apply(
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
dist.all_reduce(output, group=mp_group)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/comm.py", line 312, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/net/llm-shared-nfs/nfs/mayank/DeepSpeed/deepspeed/comm/torch.py", line 49, in all_reduce
return torch.distributed.all_reduce(tensor=tensor,
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1316, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /net/llm-shared-nfs/nfs/mayank/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.
I get the same error for batch size > 1, even with CUDA_LAUNCH_BLOCKING=1
:
gr062: RuntimeError: CUDA error: an illegal memory access was encountered
gr062: terminate called after throwing an instance of 'c10::CUDAError'
gr062: what(): CUDA error: an illegal memory access was encountered
gr062: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
gr062: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7ad7777477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #1: <unknown function> + 0x1d4a3 (0x7f7b04d684a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f7b04d6e417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
gr062: frame #3: <unknown function> + 0x458c68 (0x7f7b1755cc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f7ad775ad95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so)
gr062: frame #5: <unknown function> + 0x34db35 (0x7f7b17451b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #6: <unknown function> + 0x681fc8 (0x7f7b17785fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f7b177862c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
gr062: frame #8: <unknown function> + 0x127e28 (0x55bbd032ae28 in /ext3/miniconda3/bin/python3.9)
gr062: frame #9: <unknown function> + 0x134ad8 (0x55bbd0337ad8 in /ext3/miniconda3/bin/python3.9)
gr062: frame #10: <unknown function> + 0x1487ce (0x55bbd034b7ce in /ext3/miniconda3/bin/python3.9)
gr062: frame #11: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #12: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #13: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #14: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9)
gr062: frame #15: <unknown function> + 0x11c661 (0x55bbd031f661 in /ext3/miniconda3/bin/python3.9)
gr062: frame #16: PyDict_SetItemString + 0x4a (0x55bbd032581a in /ext3/miniconda3/bin/python3.9)
gr062: frame #17: <unknown function> + 0x214aec (0x55bbd0417aec in /ext3/miniconda3/bin/python3.9)
gr062: frame #18: Py_FinalizeEx + 0x186 (0x55bbd0416f56 in /ext3/miniconda3/bin/python3.9)
gr062: frame #19: Py_RunMain + 0x112 (0x55bbd040a2b2 in /ext3/miniconda3/bin/python3.9)
gr062: frame #20: Py_BytesMain + 0x39 (0x55bbd03dcb79 in /ext3/miniconda3/bin/python3.9)
gr062: frame #21: __libc_start_main + 0xf3 (0x7f7b5cb060b3 in /lib/x86_64-linux-gnu/libc.so.6)
gr062: frame #22: <unknown function> + 0x1d9a81 (0x55bbd03dca81 in /ext3/miniconda3/bin/python3.9)
@stas00 @RezaYazdaniAminabadi
I get the same error for batch size > 1:
gr062: RuntimeError: CUDA error: an illegal memory access was encountered gr062: terminate called after throwing an instance of 'c10::CUDAError' gr062: what(): CUDA error: an illegal memory access was encountered gr062: Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first): gr062: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7ad7777477 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so) gr062: frame #1: <unknown function> + 0x1d4a3 (0x7f7b04d684a3 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) gr062: frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f7b04d6e417 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10_cuda.so) gr062: frame #3: <unknown function> + 0x458c68 (0x7f7b1755cc68 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so) gr062: frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f7ad775ad95 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libc10.so) gr062: frame #5: <unknown function> + 0x34db35 (0x7f7b17451b35 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so) gr062: frame #6: <unknown function> + 0x681fc8 (0x7f7b17785fc8 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so) gr062: frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b5 (0x7f7b177862c5 in /ext3/miniconda3/lib/python3.9/site-packages/torch/lib/libtorch_python.so) gr062: frame #8: <unknown function> + 0x127e28 (0x55bbd032ae28 in /ext3/miniconda3/bin/python3.9) gr062: frame #9: <unknown function> + 0x134ad8 (0x55bbd0337ad8 in /ext3/miniconda3/bin/python3.9) gr062: frame #10: <unknown function> + 0x1487ce (0x55bbd034b7ce in /ext3/miniconda3/bin/python3.9) gr062: frame #11: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9) gr062: frame #12: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9) gr062: frame #13: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9) gr062: frame #14: <unknown function> + 0x1487bb (0x55bbd034b7bb in /ext3/miniconda3/bin/python3.9) gr062: frame #15: <unknown function> + 0x11c661 (0x55bbd031f661 in /ext3/miniconda3/bin/python3.9) gr062: frame #16: PyDict_SetItemString + 0x4a (0x55bbd032581a in /ext3/miniconda3/bin/python3.9) gr062: frame #17: <unknown function> + 0x214aec (0x55bbd0417aec in /ext3/miniconda3/bin/python3.9) gr062: frame #18: Py_FinalizeEx + 0x186 (0x55bbd0416f56 in /ext3/miniconda3/bin/python3.9) gr062: frame #19: Py_RunMain + 0x112 (0x55bbd040a2b2 in /ext3/miniconda3/bin/python3.9) gr062: frame #20: Py_BytesMain + 0x39 (0x55bbd03dcb79 in /ext3/miniconda3/bin/python3.9) gr062: frame #21: __libc_start_main + 0xf3 (0x7f7b5cb060b3 in /lib/x86_64-linux-gnu/libc.so.6) gr062: frame #22: <unknown function> + 0x1d9a81 (0x55bbd03dca81 in /ext3/miniconda3/bin/python3.9)
@asaparov Okay, at least this is reproducible, thanks.
I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?
I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?
What is your CUDA version and DeepSpeed? I personally had CUDA11.5 and DeepSpeed 0.7.0 installed from ds-inference/bloom-fix
branch, and I can inference BLOOM with batch size equal to 1 on two nodes.
I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?
What is your CUDA version and DeepSpeed? I personally had CUDA11.5 and DeepSpeed 0.7.0 installed from
ds-inference/bloom-fix
branch, and I can inference BLOOM with batch size equal to 1 on two nodes.
I am using CUDA-11.6 and deepspeed is built from master
@mayank31398 Perhaps try the ds-inference/bloom-fix
branch of deepspeed?
@mayank31398 Perhaps try the
ds-inference/bloom-fix
branch of deepspeed?
Ill try this today. thanks
Actually, I just tried running with larger batch sizes (16 and 32) and it doesn't run into the "CUDA illegal memory access" error (as I did with batch size=2). Maybe it is intermittent? Or maybe something's wrong with batch size 2 specifically.
Actually, I just tried running with larger batch sizes (16 and 32) and it doesn't run into the "CUDA illegal memory access" error (as I did with batch size=2). Maybe it is intermittent? Or maybe something's wrong with batch size 2 specifically.
We (with @pai4451) tried batch_size from 8 to 2, all of them failed. but yet try batch_size > 8. Pai will test it today to see what happen in our side.
@asaparov I tried the inference script with batch sizes = 1, 2, 4, 8, 16, 32, 64 and 128. Only batch sizes equal 1 and 32 work, which is a bit surprising. Anyways we’ll have to wait someone to fix the issue in this repo.
Hi all,
There are some new changes merged at DeepSpeed master. Would you mind trying that? I have tried with batch 1 and 128 and both are working on my side (I ran it on 8 A100 80GB). I will try on A100-40G as well to make sure all is fine. Also, you can now generate MP-sharded checkpoints to load the model much faster. You can find more information in this PR: https://github.com/microsoft/DeepSpeed/pull/2132 Thanks, Reza
@RezaYazdaniAminabadi could you give some hint (where to get the doc) about "generate MP-sharded checkpoints"? So far we have only the 70 .bin files downloaded from huggingface. Do you mean there's a tool re-formatting these 70 files into world-size pieces to speed up model loading? Thanks in advance.
Hi @pohunghuang-nctu
Sure, you need to pass save_mp_checkpoint_path
to the init_inference
method in order to save the tp-sharded checkpoints in the path you specified. You will see that after loading the checkpoint, DeepSpeed starts saving the new checkpoints, and you will eventually have the tp-sharded checkpoints. In addition, there will be a json config file saved in that path (like bloom_ds-inference-config.json) that you can pass as the checkpoint
argument to init_inference
in the next run. Note that you can remove save_mp_checkpoint_path
after you save the tp-sharded checkpoints for the first time, so that DeepSpeed doesn't always save a new checkpoint for you.
Best, Reza
@RezaYazdaniAminabadi I was testing with the newly merged code last night but still hit the illegal memory accesses intermittently on the larger batch sizes. It wasn't like throwing a dice though, it would work for like a half hour and then stop working for another block of time and then start working again.
For the first time I was able to use some larger batch sizes though (at least part of the time), so something seems to have improved.
EDIT: these tests were on 8x A100 80GB
I am glad you could run it with large batch now! :) I think this might be related to some cache allocation issues. We are working on optimizing that part too.
@RezaYazdaniAminabadi I used the master branch of DeepSpeed to run the inference script. But this illegal memory access is still occurring when input prompt is long for batch size 1. For larger batch sizes, I can inference from 8 up to 32. But somehow the illegal memory error appeared for batch size 2 and 4.
I am trying to get multi-node inference working with 4 nodes, each with 4xRTX8000 GPUs (48GB per GPU).
deepspeed --hostfile=$hostfile Megatron-DeepSpeed/scripts/inference/bloom-ds-inference.py --name bigscience/bloom
.The script finishes loading all the checkpoints and begins inference but then quickly runs into the following error:
I've tried with CUDA 10.2 and 11.6 and there's no difference.