Open xiejibing opened 3 weeks ago
Here is the log after run `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py`
vllm: 0.5.4
cuda:12.2
-----
W0819 19:28:50.060000 140406652626752 torch/distributed/run.py:779]
W0819 19:28:50.060000 140406652626752 torch/distributed/run.py:779] *****************************************
W0819 19:28:50.060000 140406652626752 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0819 19:28:50.060000 140406652626752 torch/distributed/run.py:779] *****************************************
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.20.5+cuda12.4
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO cudaDriverVersion 12020
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO Bootstrap : Using eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Failed to open libibverbs.so[.1]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO NET/Socket : Using [0]eth0:10.177.137.15<0>
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Using non-device net plugin version 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Using network Socket
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO comm 0x7c429b0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 3a000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO comm 0x7d01420 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5d000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO comm 0x87952f0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId ba000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO comm 0x7687540 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId db000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO comm 0x7e3e850 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 2a000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO comm 0x8d3a240 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 9a000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO comm 0x7a20aa0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 18000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO comm 0x8df0a00 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ab000 commId 0x50e06a01eda5a139 - Init START
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Setting affinity for GPU 3 to 0fff,ffff0000,00000000,00000000,0fffffff
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO NVLS multicast support is available on dev 3
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Setting affinity for GPU 6 to fffffff0,00000000,00000000,0000ffff,fff00000,00000000,00000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO NVLS multicast support is available on dev 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Setting affinity for GPU 4 to 0f,ffffff00,00000000,00000000,000fffff,ff000000,00000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO NVLS multicast support is available on dev 4
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Setting affinity for GPU 7 to 0f,ffffff00,00000000,00000000,000fffff,ff000000,00000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO NVLS multicast support is available on dev 7
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Setting affinity for GPU 2 to ff,fffff000,00000000,00000000,00ffffff,f0000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO NVLS multicast support is available on dev 2
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Setting affinity for GPU 0 to 0fff,ffff0000,00000000,00000000,0fffffff
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO NVLS multicast support is available on dev 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Setting affinity for GPU 1 to ff,fffff000,00000000,00000000,00ffffff,f0000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO NVLS multicast support is available on dev 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Setting affinity for GPU 5 to fffffff0,00000000,00000000,0000ffff,fff00000,00000000,00000000
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO NVLS multicast support is available on dev 5
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO comm 0x87952f0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO comm 0x8d3a240 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO comm 0x7d01420 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO comm 0x7e3e850 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5 [4] -1/-1/-1->6->5 [5] -1/-1/-1->6->5 [6] -1/-1/-1->6->5 [7] -1/-1/-1->6->5 [8] -1/-1/-1->6->5 [9] -1/-1/-1->6->5 [10] -1/-1/-1->6->5 [11] -1/-1/-1->6->5 [12] -1/-1/-1->6->5 [13] -1/-1/-1->6->5 [14] -1/-1/-1->6->5 [15] -1/-1/-1->6->5 [16] -1/-1/-1->6->5 [17] -1/-1/-1->6->5 [18] -1/-1/-1->6->5 [19] -1/-1/-1->6->5 [20] -1/-1/-1->6->5 [21] -1/-1/-1->6->5 [22] -1/-1/-1->6->5 [23] -1/-1/-1->6->5
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO comm 0x8df0a00 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Trees [0] 7/-1/-1->4->2 [1] 7/-1/-1->4->2 [2] 7/-1/-1->4->2 [3] 7/-1/-1->4->2 [4] 7/-1/-1->4->2 [5] 7/-1/-1->4->2 [6] 7/-1/-1->4->2 [7] 7/-1/-1->4->2 [8] 7/-1/-1->4->2 [9] 7/-1/-1->4->2 [10] 7/-1/-1->4->2 [11] 7/-1/-1->4->2 [12] 7/-1/-1->4->2 [13] 7/-1/-1->4->2 [14] 7/-1/-1->4->2 [15] 7/-1/-1->4->2 [16] 7/-1/-1->4->2 [17] 7/-1/-1->4->2 [18] 7/-1/-1->4->2 [19] 7/-1/-1->4->2 [20] 7/-1/-1->4->2 [21] 7/-1/-1->4->2 [22] 7/-1/-1->4->2 [23] 7/-1/-1->4->2
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Trees [0] 1/-1/-1->3->0 [1] 1/-1/-1->3->0 [2] 1/-1/-1->3->0 [3] 1/-1/-1->3->0 [4] 1/-1/-1->3->0 [5] 1/-1/-1->3->0 [6] 1/-1/-1->3->0 [7] 1/-1/-1->3->0 [8] 1/-1/-1->3->0 [9] 1/-1/-1->3->0 [10] 1/-1/-1->3->0 [11] 1/-1/-1->3->0 [12] 1/-1/-1->3->0 [13] 1/-1/-1->3->0 [14] 1/-1/-1->3->0 [15] 1/-1/-1->3->0 [16] 1/-1/-1->3->0 [17] 1/-1/-1->3->0 [18] 1/-1/-1->3->0 [19] 1/-1/-1->3->0 [20] 1/-1/-1->3->0 [21] 1/-1/-1->3->0 [22] 1/-1/-1->3->0 [23] 1/-1/-1->3->0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Trees [0] 6/-1/-1->5->7 [1] 6/-1/-1->5->7 [2] 6/-1/-1->5->7 [3] 6/-1/-1->5->7 [4] 6/-1/-1->5->7 [5] 6/-1/-1->5->7 [6] 6/-1/-1->5->7 [7] 6/-1/-1->5->7 [8] 6/-1/-1->5->7 [9] 6/-1/-1->5->7 [10] 6/-1/-1->5->7 [11] 6/-1/-1->5->7 [12] 6/-1/-1->5->7 [13] 6/-1/-1->5->7 [14] 6/-1/-1->5->7 [15] 6/-1/-1->5->7 [16] 6/-1/-1->5->7 [17] 6/-1/-1->5->7 [18] 6/-1/-1->5->7 [19] 6/-1/-1->5->7 [20] 6/-1/-1->5->7 [21] 6/-1/-1->5->7 [22] 6/-1/-1->5->7 [23] 6/-1/-1->5->7
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Trees [0] 2/-1/-1->1->3 [1] 2/-1/-1->1->3 [2] 2/-1/-1->1->3 [3] 2/-1/-1->1->3 [4] 2/-1/-1->1->3 [5] 2/-1/-1->1->3 [6] 2/-1/-1->1->3 [7] 2/-1/-1->1->3 [8] 2/-1/-1->1->3 [9] 2/-1/-1->1->3 [10] 2/-1/-1->1->3 [11] 2/-1/-1->1->3 [12] 2/-1/-1->1->3 [13] 2/-1/-1->1->3 [14] 2/-1/-1->1->3 [15] 2/-1/-1->1->3 [16] 2/-1/-1->1->3 [17] 2/-1/-1->1->3 [18] 2/-1/-1->1->3 [19] 2/-1/-1->1->3 [20] 2/-1/-1->1->3 [21] 2/-1/-1->1->3 [22] 2/-1/-1->1->3 [23] 2/-1/-1->1->3
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO comm 0x7c429b0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO comm 0x7687540 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO comm 0x7a20aa0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 00/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 01/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Trees [0] 5/-1/-1->7->4 [1] 5/-1/-1->7->4 [2] 5/-1/-1->7->4 [3] 5/-1/-1->7->4 [4] 5/-1/-1->7->4 [5] 5/-1/-1->7->4 [6] 5/-1/-1->7->4 [7] 5/-1/-1->7->4 [8] 5/-1/-1->7->4 [9] 5/-1/-1->7->4 [10] 5/-1/-1->7->4 [11] 5/-1/-1->7->4 [12] 5/-1/-1->7->4 [13] 5/-1/-1->7->4 [14] 5/-1/-1->7->4 [15] 5/-1/-1->7->4 [16] 5/-1/-1->7->4 [17] 5/-1/-1->7->4 [18] 5/-1/-1->7->4 [19] 5/-1/-1->7->4 [20] 5/-1/-1->7->4 [21] 5/-1/-1->7->4 [22] 5/-1/-1->7->4 [23] 5/-1/-1->7->4
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Trees [0] 4/-1/-1->2->1 [1] 4/-1/-1->2->1 [2] 4/-1/-1->2->1 [3] 4/-1/-1->2->1 [4] 4/-1/-1->2->1 [5] 4/-1/-1->2->1 [6] 4/-1/-1->2->1 [7] 4/-1/-1->2->1 [8] 4/-1/-1->2->1 [9] 4/-1/-1->2->1 [10] 4/-1/-1->2->1 [11] 4/-1/-1->2->1 [12] 4/-1/-1->2->1 [13] 4/-1/-1->2->1 [14] 4/-1/-1->2->1 [15] 4/-1/-1->2->1 [16] 4/-1/-1->2->1 [17] 4/-1/-1->2->1 [18] 4/-1/-1->2->1 [19] 4/-1/-1->2->1 [20] 4/-1/-1->2->1 [21] 4/-1/-1->2->1 [22] 4/-1/-1->2->1 [23] 4/-1/-1->2->1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 02/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 03/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 04/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 05/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 06/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 07/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 08/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 09/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 10/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 11/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 12/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 13/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 14/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 15/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 16/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 17/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 18/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 19/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 20/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 21/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 22/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 23/24 : 0 3 1 2 4 7 5 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 3/-1/-1->0->-1 [3] 3/-1/-1->0->-1 [4] 3/-1/-1->0->-1 [5] 3/-1/-1->0->-1 [6] 3/-1/-1->0->-1 [7] 3/-1/-1->0->-1 [8] 3/-1/-1->0->-1 [9] 3/-1/-1->0->-1 [10] 3/-1/-1->0->-1 [11] 3/-1/-1->0->-1 [12] 3/-1/-1->0->-1 [13] 3/-1/-1->0->-1 [14] 3/-1/-1->0->-1 [15] 3/-1/-1->0->-1 [16] 3/-1/-1->0->-1 [17] 3/-1/-1->0->-1 [18] 3/-1/-1->0->-1 [19] 3/-1/-1->0->-1 [20] 3/-1/-1->0->-1 [21] 3/-1/-1->0->-1 [22] 3/-1/-1->0->-1 [23] 3/-1/-1->0->-1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO P2P Chunksize set to 524288
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 00/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 04/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 01/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 05/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 02/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 06/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 03/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 07/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 04/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 08/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 05/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 09/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 06/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 10/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 07/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 11/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 08/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 12/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 09/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 13/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 10/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 14/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 11/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 15/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 12/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 16/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 13/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 17/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 14/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 18/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 15/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 19/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 16/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 20/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 17/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 21/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 18/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 22/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 19/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 23/0 : 6[6] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 20/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 21/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 01/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 22/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 02/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 23/0 : 2[2] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 03/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 00/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 04/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 01/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 05/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 02/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 06/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 03/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 07/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 04/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 08/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 05/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 09/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 06/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 10/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 07/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 11/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 08/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 12/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 09/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 13/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 10/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 14/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 11/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 15/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 12/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 16/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 13/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 17/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 14/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 18/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 15/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 19/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 16/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 17/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 20/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 18/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 21/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 19/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 22/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 20/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Channel 23/0 : 0[0] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 21/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 00/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 22/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 01/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 23/0 : 4[4] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 02/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 03/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 00/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 04/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 01/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 05/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 02/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 06/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 03/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 07/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 08/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 09/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 10/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 11/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 12/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 04/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 13/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 05/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 14/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 06/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 15/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 07/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 16/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 08/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 17/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 09/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 10/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 18/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 11/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 12/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 19/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 13/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 20/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 14/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 21/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 15/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 16/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 17/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 18/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 19/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 20/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 22/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 21/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 23/0 : 3[3] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 22/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 23/0 : 7[7] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 00/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Connected all rings
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 00/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 01/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 01/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 02/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 02/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 03/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 03/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 04/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 04/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 05/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 05/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 06/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 06/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 07/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 07/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 08/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 08/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 09/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 09/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 10/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 10/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 11/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 11/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 12/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 12/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 13/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 13/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 14/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 14/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 15/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 15/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 16/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 16/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 16/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 17/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 17/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 17/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 18/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 18/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 18/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 19/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 19/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 19/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 20/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 20/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 20/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 21/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 21/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 21/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 22/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 22/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 22/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Channel 23/0 : 5[5] -> 7[7] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Channel 23/0 : 6[6] -> 5[5] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Channel 23/0 : 1[1] -> 3[3] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 00/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 01/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 02/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 03/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 04/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 04/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 05/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 05/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 06/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 06/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 07/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 07/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 08/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 08/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 09/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 09/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 10/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 10/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 11/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 11/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 12/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 12/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 13/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 13/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 14/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 14/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 15/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 15/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 16/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 16/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 17/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 17/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 18/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 18/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 19/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 19/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 20/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 20/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 21/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 21/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 22/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 22/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Channel 23/0 : 7[7] -> 4[4] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Channel 23/0 : 3[3] -> 0[0] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 00/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 01/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 02/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 03/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 04/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 05/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 06/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 07/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 08/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 09/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 10/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 11/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 12/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 13/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 14/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 15/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 16/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 17/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 18/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 19/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 20/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 21/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 22/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Channel 23/0 : 4[4] -> 2[2] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 16/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 17/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 18/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 19/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 20/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 21/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 22/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Channel 23/0 : 2[2] -> 1[1] via P2P/CUMEM
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO NVLS comm 0x87952f0 headRank 7 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO NVLS comm 0x8df0a00 headRank 6 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO Connected all trees
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO NVLS comm 0x7d01420 headRank 1 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO NVLS comm 0x7a20aa0 headRank 0 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO NVLS comm 0x8d3a240 headRank 4 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO NVLS comm 0x7e3e850 headRank 2 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO NVLS comm 0x7c429b0 headRank 3 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO NVLS comm 0x7687540 headRank 5 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 301989888 nvlsTotalSize 2415919104
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO transport/nvls.cc:328 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2379 [2] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO init.cc:1236 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2383 [3] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO init.cc:1501 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2380 [5] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2378 [1] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2377 [0] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2382 [6] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2381 [7] NCCL INFO group.cc:64 -> 1 [Async thread]
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO group.cc:418 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO init.cc:1876 -> 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2395 [4] NCCL INFO [Service thread] Connection closed by localRank 4
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2391 [6] NCCL INFO [Service thread] Connection closed by localRank 6
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2397 [1] NCCL INFO [Service thread] Connection closed by localRank 1
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2400 [2] NCCL INFO [Service thread] Connection closed by localRank 2
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2399 [3] NCCL INFO [Service thread] Connection closed by localRank 3
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2405 [0] NCCL INFO [Service thread] Connection closed by localRank 0
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2403 [7] NCCL INFO [Service thread] Connection closed by localRank 7
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2392 [5] NCCL INFO [Service thread] Connection closed by localRank 5
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2342:2342 [6] NCCL INFO comm 0x87952f0 rank 6 nranks 8 cudaDev 6 busId ba000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2336:2336 [0] NCCL INFO comm 0x7a20aa0 rank 0 nranks 8 cudaDev 0 busId 18000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2343:2343 [7] NCCL INFO comm 0x7687540 rank 7 nranks 8 cudaDev 7 busId db000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2341:2341 [5] NCCL INFO comm 0x8df0a00 rank 5 nranks 8 cudaDev 5 busId ab000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2339:2339 [3] NCCL INFO comm 0x7d01420 rank 3 nranks 8 cudaDev 3 busId 5d000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2337:2337 [1] NCCL INFO comm 0x7e3e850 rank 1 nranks 8 cudaDev 1 busId 2a000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2338:2338 [2] NCCL INFO comm 0x7c429b0 rank 2 nranks 8 cudaDev 2 busId 3a000 - Abort COMPLETE
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2340 [4] NCCL INFO comm 0x8d3a240 rank 4 nranks 8 cudaDev 4 busId 9a000 - Abort COMPLETE
[rank6]: Traceback (most recent call last):
[rank6]: File "/tmp/test.py", line 8, in <module>
[rank6]: dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank6]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank6]: return func(*args, **kwargs)
[rank6]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank6]: work = group.allreduce([tensor], opts)
[rank6]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank6]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank6]: Last error:
[rank6]: Cuda failure 1 'invalid argument'
[rank5]: Traceback (most recent call last):
[rank5]: File "/tmp/test.py", line 8, in <module>
[rank5]: dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank5]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank5]: return func(*args, **kwargs)
[rank5]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank5]: work = group.allreduce([tensor], opts)
[rank5]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank5]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank5]: Last error:
[rank5]: Cuda failure 1 'invalid argument'
[rank1]: Traceback (most recent call last):
[rank1]: File "/tmp/test.py", line 8, in <module>
[rank1]: dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank1]: work = group.allreduce([tensor], opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank1]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank1]: Last error:
[rank1]: Cuda failure 1 'invalid argument'
[rank0]: Traceback (most recent call last):
[rank0]: File "/tmp/test.py", line 8, in <module>
[rank0]: dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank0]: work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 1 'invalid argument'
[rank4]: Traceback (most recent call last):
[rank4]: File "/tmp/test.py", line 8, in <module>
[rank4]: dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank4]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank4]: return func(*args, **kwargs)
[rank4]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank4]: work = group.allreduce([tensor], opts)
[rank4]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank4]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank4]: Last error:
[rank4]: Cuda failure 1 'invalid argument'
[rank2]: Traceback (most recent call last):
[rank2]: File "/tmp/test.py", line 8, in <module>
[rank2]: dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank2]: work = group.allreduce([tensor], opts)
[rank2]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank2]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank2]: Last error:
[rank2]: Cuda failure 1 'invalid argument'
[rank7]: Traceback (most recent call last):
[rank7]: File "/tmp/test.py", line 8, in <module>
[rank7]: dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank7]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank7]: return func(*args, **kwargs)
[rank7]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank7]: work = group.allreduce([tensor], opts)
[rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Cuda failure 1 'invalid argument'
[rank3]: Traceback (most recent call last):
[rank3]: File "/tmp/test.py", line 8, in <module>
[rank3]: dist.all_reduce(data, op=dist.ReduceOp.SUM)
[rank3]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank3]: work = group.allreduce([tensor], opts)
[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank3]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank3]: Last error:
[rank3]: Cuda failure 1 'invalid argument'
[rank0]:[W819 19:29:00.828498674 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0819 19:29:00.785000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2336 closing signal SIGTERM
W0819 19:29:00.786000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2337 closing signal SIGTERM
W0819 19:29:00.786000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2338 closing signal SIGTERM
W0819 19:29:00.786000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2339 closing signal SIGTERM
W0819 19:29:00.786000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2340 closing signal SIGTERM
W0819 19:29:00.786000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2343 closing signal SIGTERM
E0819 19:29:01.095000 140406652626752 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 5 (pid: 2341) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
test.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-08-19_19:29:00
host : h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 2342)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-19_19:29:00
host : h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 2341)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
This error:
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] transport/nvls.cc:157 NCCL WARN Cuda failure 1 'invalid argument'
h100-kickoff-llama-70b-chat-20230328-bcf75c4f6-gn6nd:2340:2384 [4] NCCL INFO transport/nvls.cc:328 -> 1
Is usually caused by the GPU being reset without the correct sequence of first stopping and the restarting the Nvidia Fabric Manager service. See Section 2.2 of the Fabric Manager User Guide
@AddyLaddy Thanks lot for your information! We will check the GPU status.
issue
check NCLL failed, and the error:
environment
description
When I run with:
NCCL_DEBUG=TRACE torchrun --nproc-per-node=2 test.py
. The check passed. When I run with:NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py
. The check failed. The only difference is thenproc-per-node
number.here is the check script in vllm https://docs.vllm.ai/en/latest/getting_started/debugging.html#:~:text=is%20working%20correctly.-,%23%20Test%20PyTorch%20NCCL,-import%20torch%0Aimport