NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.13k stars 790 forks source link

NCCL ncclUnhandledCudaError: Call to CUDA function failed #1410

Open xiejibing opened 3 weeks ago

xiejibing commented 3 weeks ago

issue

check NCLL failed, and the error:

[rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Cuda failure 1 'invalid argument'

environment

Collecting environment information...
There was a problem when trying to write in your cache folder (/.cache/huggingface/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.31

Python version: 3.10.12 (main, Jul  5 2023, 19:22:19) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-26-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   52 bits physical, 57 bits virtual
CPU(s):                          224
On-line CPU(s) list:             0-223
Thread(s) per core:              2
Core(s) per socket:              56
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8480+
Stepping:                        8
Frequency boost:                 enabled
CPU MHz:                         2001.000
CPU max MHz:                     2001.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4000.00
Virtualization:                  VT-x
L1d cache:                       5.3 MiB
L1i cache:                       3.5 MiB
L2 cache:                        224 MiB
L3 cache:                        210 MiB
NUMA node0 CPU(s):               0-55,112-167
NUMA node1 CPU(s):               56-111,168-223
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pyzmq==26.0.3
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.44.0
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] pyzmq                     26.0.3                   pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.44.0                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS SYS 0-27,112-139    0       N/A
GPU1    NV18     X  NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    28-55,140-167   1       N/A
GPU2    NV18    NV18     X  NV18    NV18    NV18    NV18    NV18    NODE    NODE    28-55,140-167   1       N/A
GPU3    NV18    NV18    NV18     X  NV18    NV18    NV18    NV18    SYS SYS 0-27,112-139    0       N/A
GPU4    NV18    NV18    NV18    NV18     X  NV18    NV18    NV18    SYS SYS 56-83,168-195   2       N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X  NV18    NV18    SYS SYS 84-111,196-223  3       N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X  NV18    SYS SYS 84-111,196-223  3       N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X  SYS SYS 56-83,168-195   2       N/A
NIC0    SYS NODE    NODE    SYS SYS SYS SYS SYS  X  PIX             
NIC1    SYS NODE    NODE    SYS SYS SYS SYS SYS PIX  X              

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

description

When I run with: NCCL_DEBUG=TRACE torchrun --nproc-per-node=2 test.py. The check passed. When I run with: NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py. The check failed. The only difference is the nproc-per-node number.

here is the check script in vllm https://docs.vllm.ai/en/latest/getting_started/debugging.html#:~:text=is%20working%20correctly.-,%23%20Test%20PyTorch%20NCCL,-import%20torch%0Aimport

# Test PyTorch NCCL
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
data = torch.FloatTensor([1,] * 128).to("cuda")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
world_size = dist.get_world_size()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch NCCL is successful!")

# Test PyTorch GLOO
gloo_group = dist.new_group(ranks=list(range(world_size)), backend="gloo")
cpu_data = torch.FloatTensor([1,] * 128)
dist.all_reduce(cpu_data, op=dist.ReduceOp.SUM, group=gloo_group)
value = cpu_data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("PyTorch GLOO is successful!")

# Test vLLM NCCL, with cuda graph
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator

pynccl = PyNcclCommunicator(group=gloo_group, device=local_rank)
pynccl.disabled = False

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    data.fill_(1)
    pynccl.all_reduce(data, stream=s)
    value = data.mean().item()
    assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL is successful!")

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(cuda_graph=g, stream=s):
    pynccl.all_reduce(data, stream=torch.cuda.current_stream())

data.fill_(1)
g.replay()
torch.cuda.current_stream().synchronize()
value = data.mean().item()
assert value == world_size, f"Expected {world_size}, got {value}"

print("vLLM NCCL with cuda graph is successful!")

dist.destroy_process_group(gloo_group)
dist.destroy_process_group()
xiejibing commented 3 weeks ago

Logs

Here is the log after run `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py`

vllm: 0.5.4
cuda:12.2

-----
W0819 19:28:50.060000 140406652626752 torch/distributed/run.py:779] 
W0819 19:28:50.060000 140406652626752 torch/distributed/run.py:779] *****************************************
W0819 19:28:50.060000 140406652626752 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0819 19:28:50.060000 140406652626752 torch/distributed/run.py:779] *****************************************
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO comm 0x7c429b0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3a000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO comm 0x7d01420 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO comm 0x87952f0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId ba000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO comm 0x7687540 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO comm 0x7e3e850 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO comm 0x8d3a240 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9a000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO comm 0x7a20aa0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 18000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO comm 0x8df0a00 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Setting affinity for GPU 3 to 0fff,ffff0000,00000000,00000000,0fffffff
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO NVLS multicast support is available on dev 3
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Setting affinity for GPU 6 to fffffff0,00000000,00000000,0000ffff,fff00000,00000000,00000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO NVLS multicast support is available on dev 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ffffff00,00000000,00000000,000fffff,ff000000,00000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO NVLS multicast support is available on dev 4
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ffffff00,00000000,00000000,000fffff,ff000000,00000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO NVLS multicast support is available on dev 7
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Setting affinity for GPU 2 to ff,fffff000,00000000,00000000,00ffffff,f0000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO NVLS multicast support is available on dev 2
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Setting affinity for GPU 0 to 0fff,ffff0000,00000000,00000000,0fffffff
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO NVLS multicast support is available on dev 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Setting affinity for GPU 1 to ff,fffff000,00000000,00000000,00ffffff,f0000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO NVLS multicast support is available on dev 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Setting affinity for GPU 5 to fffffff0,00000000,00000000,0000ffff,fff00000,00000000,00000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO NVLS multicast support is available on dev 5
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO comm 0x87952f0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO comm 0x8d3a240 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO comm 0x7d01420 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO comm 0x7e3e850 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5 [16] -1/-1/-1->6->5 [17] -1/-1/-1->6->5 [18] -1/-1/-1->6->5 [19] -1/-1/-1->6->5 [20] -1/-1/-1->6->5 [21] -1/-1/-1->6->5 [22] -1/-1/-1->6->5 [23] -1/-1/-1->6->5
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO comm 0x8df0a00 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Trees [0] 7/-1/-1->4->2 [1] 7/-1/-1->4->2 [2] 7/-1/-1->4->2 [3] 7/-1/-1->4->2 [4] 7/-1/-1->4->2 [5] 7/-1/-1->4->2 [6] 7/-1/-1->4->2 [7] 7/-1/-1->4->2 [8] 7/-1/-1->4->2 [9] 7/-1/-1->4->2 [10] 7/-1/-1->4->2 [11] 7/-1/-1->4->2 [12] 7/-1/-1->4->2 [13] 7/-1/-1->4->2 [14] 7/-1/-1->4->2 [15] 7/-1/-1->4->2 [16] 7/-1/-1->4->2 [17] 7/-1/-1->4->2 [18] 7/-1/-1->4->2 [19] 7/-1/-1->4->2 [20] 7/-1/-1->4->2 [21] 7/-1/-1->4->2 [22] 7/-1/-1->4->2 [23] 7/-1/-1->4->2
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Trees [0] 1/-1/-1->3->0 [1] 1/-1/-1->3->0 [2] 1/-1/-1->3->0 [3] 1/-1/-1->3->0 [4] 1/-1/-1->3->0 [5] 1/-1/-1->3->0 [6] 1/-1/-1->3->0 [7] 1/-1/-1->3->0 [8] 1/-1/-1->3->0 [9] 1/-1/-1->3->0 [10] 1/-1/-1->3->0 [11] 1/-1/-1->3->0 [12] 1/-1/-1->3->0 [13] 1/-1/-1->3->0 [14] 1/-1/-1->3->0 [15] 1/-1/-1->3->0 [16] 1/-1/-1->3->0 [17] 1/-1/-1->3->0 [18] 1/-1/-1->3->0 [19] 1/-1/-1->3->0 [20] 1/-1/-1->3->0 [21] 1/-1/-1->3->0 [22] 1/-1/-1->3->0 [23] 1/-1/-1->3->0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Trees [0] 6/-1/-1->5->7 [1] 6/-1/-1->5->7 [2] 6/-1/-1->5->7 [3] 6/-1/-1->5->7 [4] 6/-1/-1->5->7 [5] 6/-1/-1->5->7 [6] 6/-1/-1->5->7 [7] 6/-1/-1->5->7 [8] 6/-1/-1->5->7 [9] 6/-1/-1->5->7 [10] 6/-1/-1->5->7 [11] 6/-1/-1->5->7 [12] 6/-1/-1->5->7 [13] 6/-1/-1->5->7 [14] 6/-1/-1->5->7 [15] 6/-1/-1->5->7 [16] 6/-1/-1->5->7 [17] 6/-1/-1->5->7 [18] 6/-1/-1->5->7 [19] 6/-1/-1->5->7 [20] 6/-1/-1->5->7 [21] 6/-1/-1->5->7 [22] 6/-1/-1->5->7 [23] 6/-1/-1->5->7
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Trees [0] 2/-1/-1->1->3 [1] 2/-1/-1->1->3 [2] 2/-1/-1->1->3 [3] 2/-1/-1->1->3 [4] 2/-1/-1->1->3 [5] 2/-1/-1->1->3 [6] 2/-1/-1->1->3 [7] 2/-1/-1->1->3 [8] 2/-1/-1->1->3 [9] 2/-1/-1->1->3 [10] 2/-1/-1->1->3 [11] 2/-1/-1->1->3 [12] 2/-1/-1->1->3 [13] 2/-1/-1->1->3 [14] 2/-1/-1->1->3 [15] 2/-1/-1->1->3 [16] 2/-1/-1->1->3 [17] 2/-1/-1->1->3 [18] 2/-1/-1->1->3 [19] 2/-1/-1->1->3 [20] 2/-1/-1->1->3 [21] 2/-1/-1->1->3 [22] 2/-1/-1->1->3 [23] 2/-1/-1->1->3
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO comm 0x7c429b0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO comm 0x7687540 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO comm 0x7a20aa0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 00/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 01/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Trees [0] 5/-1/-1->7->4 [1] 5/-1/-1->7->4 [2] 5/-1/-1->7->4 [3] 5/-1/-1->7->4 [4] 5/-1/-1->7->4 [5] 5/-1/-1->7->4 [6] 5/-1/-1->7->4 [7] 5/-1/-1->7->4 [8] 5/-1/-1->7->4 [9] 5/-1/-1->7->4 [10] 5/-1/-1->7->4 [11] 5/-1/-1->7->4 [12] 5/-1/-1->7->4 [13] 5/-1/-1->7->4 [14] 5/-1/-1->7->4 [15] 5/-1/-1->7->4 [16] 5/-1/-1->7->4 [17] 5/-1/-1->7->4 [18] 5/-1/-1->7->4 [19] 5/-1/-1->7->4 [20] 5/-1/-1->7->4 [21] 5/-1/-1->7->4 [22] 5/-1/-1->7->4 [23] 5/-1/-1->7->4
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Trees [0] 4/-1/-1->2->1 [1] 4/-1/-1->2->1 [2] 4/-1/-1->2->1 [3] 4/-1/-1->2->1 [4] 4/-1/-1->2->1 [5] 4/-1/-1->2->1 [6] 4/-1/-1->2->1 [7] 4/-1/-1->2->1 [8] 4/-1/-1->2->1 [9] 4/-1/-1->2->1 [10] 4/-1/-1->2->1 [11] 4/-1/-1->2->1 [12] 4/-1/-1->2->1 [13] 4/-1/-1->2->1 [14] 4/-1/-1->2->1 [15] 4/-1/-1->2->1 [16] 4/-1/-1->2->1 [17] 4/-1/-1->2->1 [18] 4/-1/-1->2->1 [19] 4/-1/-1->2->1 [20] 4/-1/-1->2->1 [21] 4/-1/-1->2->1 [22] 4/-1/-1->2->1 [23] 4/-1/-1->2->1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 02/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 03/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 04/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 05/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 06/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 07/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 08/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 09/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 10/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 11/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 12/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 13/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 14/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 15/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 16/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 17/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 18/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 19/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 20/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 21/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 22/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 23/24 :    0   3   1   2   4   7   5   6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 3/-1/-1->0->-1 [3] 3/-1/-1->0->-1 [4] 3/-1/-1->0->-1 [5] 3/-1/-1->0->-1 [6] 3/-1/-1->0->-1 [7] 3/-1/-1->0->-1 [8] 3/-1/-1->0->-1 [9] 3/-1/-1->0->-1 [10] 3/-1/-1->0->-1 [11] 3/-1/-1->0->-1 [12] 3/-1/-1->0->-1 [13] 3/-1/-1->0->-1 [14] 3/-1/-1->0->-1 [15] 3/-1/-1->0->-1 [16] 3/-1/-1->0->-1 [17] 3/-1/-1->0->-1 [18] 3/-1/-1->0->-1 [19] 3/-1/-1->0->-1 [20] 3/-1/-1->0->-1 [21] 3/-1/-1->0->-1 [22] 3/-1/-1->0->-1 [23] 3/-1/-1->0->-1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 00/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 01/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 02/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 03/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 04/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 05/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 06/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 07/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 08/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 09/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 10/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 11/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 12/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 16/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 13/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 17/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 14/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 18/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 15/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 19/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 16/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 20/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 17/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 21/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 18/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 22/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 19/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 23/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 20/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 21/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 01/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 22/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 02/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 23/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 03/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 00/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 04/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 01/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 05/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 02/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 06/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 03/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 07/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 04/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 08/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 05/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 09/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 06/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 10/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 07/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 11/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 08/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 12/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 09/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 13/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 10/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 14/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 11/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 15/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 12/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 16/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 13/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 17/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 14/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 18/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 15/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 19/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 16/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 17/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 20/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 18/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 21/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 19/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 22/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 20/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 23/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 21/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 00/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 22/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 01/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 23/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 02/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 03/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 00/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 04/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 01/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 05/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 02/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 06/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 03/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 07/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 08/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 09/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 10/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 11/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 12/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 04/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 13/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 05/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 14/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 06/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 15/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 07/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 16/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 08/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 17/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 09/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 10/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 18/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 11/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 12/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 19/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 13/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 20/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 14/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 21/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 15/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 16/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 17/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 18/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 19/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 20/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 22/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 21/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 23/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 22/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 23/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 00/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 00/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 01/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 01/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 02/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 02/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 03/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 03/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 04/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 04/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 05/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 05/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 06/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 06/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 07/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 07/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 08/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 08/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 09/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 09/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 10/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 10/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 11/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 11/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 12/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 12/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 13/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 13/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 14/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 14/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 15/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 15/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 16/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 16/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 17/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 17/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 18/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 18/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 19/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 19/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 20/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 20/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 21/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 21/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 22/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 22/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 23/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 23/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 00/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 01/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 02/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 03/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 04/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 04/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 05/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 05/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 06/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 06/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 07/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 07/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 08/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 08/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 09/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 09/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 10/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 10/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 11/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 11/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 12/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 12/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 13/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 13/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 14/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 14/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 15/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 15/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 16/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 16/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 17/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 17/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 18/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 18/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 19/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 19/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 20/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 20/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 21/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 21/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 22/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 22/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 23/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 23/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 00/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 01/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 02/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 03/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 04/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 05/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 06/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 07/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 08/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 09/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 10/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 11/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 12/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 13/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 14/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 15/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 16/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 17/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 18/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 19/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 20/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 21/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 22/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 23/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO NVLS comm 0x87952f0 headRank 7 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO NVLS comm 0x8df0a00 headRank 6 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO NVLS comm 0x7d01420 headRank 1 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO NVLS comm 0x7a20aa0 headRank 0 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO NVLS comm 0x8d3a240 headRank 4 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO NVLS comm 0x7e3e850 headRank 2 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO NVLS comm 0x7c429b0 headRank 3 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO NVLS comm 0x7687540 headRank 5 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104

h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO transport/nvls.cc:328 -> 1

h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO init.cc:1236 -> 1

h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'

h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'

h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO transport/nvls.cc:328 -> 1

h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'

h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO transport/nvls.cc:328 -> 1

h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2395 [4] NCCL INFO [Service thread] Connection closed by localRank 4
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2391 [6] NCCL INFO [Service thread] Connection closed by localRank 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2397 [1] NCCL INFO [Service thread] Connection closed by localRank 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2400 [2] NCCL INFO [Service thread] Connection closed by localRank 2
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2399 [3] NCCL INFO [Service thread] Connection closed by localRank 3
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2405 [0] NCCL INFO [Service thread] Connection closed by localRank 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2403 [7] NCCL INFO [Service thread] Connection closed by localRank 7
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2392 [5] NCCL INFO [Service thread] Connection closed by localRank 5
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO comm 0x87952f0 rank 6 nranks 8 cudaDev 6 busId ba000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO comm 0x7a20aa0 rank 0 nranks 8 cudaDev 0 busId 18000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO comm 0x7687540 rank 7 nranks 8 cudaDev 7 busId db000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO comm 0x8df0a00 rank 5 nranks 8 cudaDev 5 busId ab000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO comm 0x7d01420 rank 3 nranks 8 cudaDev 3 busId 5d000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO comm 0x7e3e850 rank 1 nranks 8 cudaDev 1 busId 2a000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO comm 0x7c429b0 rank 2 nranks 8 cudaDev 2 busId 3a000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO comm 0x8d3a240 rank 4 nranks 8 cudaDev 4 busId 9a000 - Abort COMPLETE
[rank6]: Traceback (most recent call last):
[rank6]:   File "/tmp/test.py", line 8, in <module>
[rank6]:     dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank6]:     work = group.allreduce([tensor], opts)
[rank6]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank6]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank6]: Last error:
[rank6]: Cuda failure 1 'invalid argument'
[rank5]: Traceback (most recent call last):
[rank5]:   File "/tmp/test.py", line 8, in <module>
[rank5]:     dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank5]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank5]:     work = group.allreduce([tensor], opts)
[rank5]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank5]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank5]: Last error:
[rank5]: Cuda failure 1 'invalid argument'
[rank1]: Traceback (most recent call last):
[rank1]:   File "/tmp/test.py", line 8, in <module>
[rank1]:     dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank1]:     work = group.allreduce([tensor], opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank1]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank1]: Last error:
[rank1]: Cuda failure 1 'invalid argument'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/tmp/test.py", line 8, in <module>
[rank0]:     dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank0]:     work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 1 'invalid argument'
[rank4]: Traceback (most recent call last):
[rank4]:   File "/tmp/test.py", line 8, in <module>
[rank4]:     dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank4]:     work = group.allreduce([tensor], opts)
[rank4]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank4]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank4]: Last error:
[rank4]: Cuda failure 1 'invalid argument'
[rank2]: Traceback (most recent call last):
[rank2]:   File "/tmp/test.py", line 8, in <module>
[rank2]:     dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank2]:     work = group.allreduce([tensor], opts)
[rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank2]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank2]: Last error:
[rank2]: Cuda failure 1 'invalid argument'
[rank7]: Traceback (most recent call last):
[rank7]:   File "/tmp/test.py", line 8, in <module>
[rank7]:     dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank7]:     work = group.allreduce([tensor], opts)
[rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Cuda failure 1 'invalid argument'
[rank3]: Traceback (most recent call last):
[rank3]:   File "/tmp/test.py", line 8, in <module>
[rank3]:     dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank3]:     work = group.allreduce([tensor], opts)
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank3]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank3]: Last error:
[rank3]: Cuda failure 1 'invalid argument'
[rank0]:[W819 19:29:00.828498674 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W0819 19:29:00.785000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2336 closing signal SIGTERM
W0819 19:29:00.786000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2337 closing signal SIGTERM
W0819 19:29:00.786000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2338 closing signal SIGTERM
W0819 19:29:00.786000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2339 closing signal SIGTERM
W0819 19:29:00.786000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2340 closing signal SIGTERM
W0819 19:29:00.786000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2343 closing signal SIGTERM
E0819 19:29:01.095000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 5 (pid: 2341) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
test.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-08-19_19:29:00
  host      : h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 2342)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-19_19:29:00
  host      : h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 2341)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
AddyLaddy commented 3 weeks ago

This error:

h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO transport/nvls.cc:328 -> 1

Is usually caused by the GPU being reset without the correct sequence of first stopping and the restarting the Nvidia Fabric Manager service. See Section 2.2 of the Fabric Manager User Guide

xiejibing commented 3 weeks ago

@AddyLaddy Thanks lot for your information! We will check the GPU status.