Closed metaVariable closed 9 months ago
Can you check again with NCCL_PROTO=^LL128
and see if the errors disappear?
Another thing worth trying: NCCL_NVLS_ENABLE=0
.
Also, can you run from 8B to 8GB (-b 8 -e 8G -f 2
) so that we can see the whole picture?
@sjeaugey
Here is logs when adding args NCCL_PROTO=^LL128
and NCCL_NVLS_ENABLE=0
, -b 8 -e 8G -f 2
.
When setting NCCL_NVLS_ENABLE=0
, the error was disappeared. So NVLink SHARP plugin or NVSwitch seems to be the cause of this problem.
Enable the use of NVLink SHARP (NVLS). NVLink SHARP is available in third-generation NVSwitch systems (NVLink4) with Hopper and later GPU architectures, allowing collectives such as
ncclAllReduce
to be offloaded to the NVSwitch domain. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable
We will check the status and configurations of related components.
NCCL_PROTO=^LL128
+ -b 8 -e 8G -f 2
mpirun --hostfile /data/hostfile -x NCCL_PROTO=^LL128 -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_ADAPTIVE_ROUTING=1 -x NCCL_NET_GDR_LEVEL=1 -x NCCL_IB_TC=98 -x LD_LIBRARY_PATH -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib /data/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 --parallel_init=1 --op="all"
sum
NCCL_NVLS_ENABLE=0
+ -b 8 -e 8G -f 2
mpirun --hostfile /data/hostfile -x NCCL_NVLS_ENABLE=0 -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_ADAPTIVE_ROUTING=1 -x NCCL_NET_GDR_LEVEL=1 -x NCCL_IB_TC=98 -x LD_LIBRARY_PATH -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib /data/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 --parallel_init=1 --op="all"
NCCL_PROTO=^LL128
+ NCCL_NVLS_ENABLE=0
+ -b 8 -e 8G -f 2
mpirun --hostfile /data/hostfile -x NCCL_NVLS_ENABLE=0 -x NCCL_PROTO=^LL128 -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_ADAPTIVE_ROUTING=1 -x NCCL_NET_GDR_LEVEL=1 -x NCCL_IB_TC=98 -x LD_LIBRARY_PATH -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib /data/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 --parallel_init=1 --op="all"
That is interesting, we've never seen that issue with NVLS before.
I notice that you're not using GDRDMA for the RoCE adapters. Are the NICs connected to the CPU root complex or to a PCI-E switch? Is the nvidia_peermem
kernel module loaded?
Note, NCCL_IB_ADAPTIVE_ROUTING=1
will only lower the performance on RoCE as adaptive routing is an InfiniBand feature.
Also, is there a reason you're using the IB SHARP plugin? Again, IB SHARP is an InfiniBand only feature.
Maybe we can also get to see the node topo info from NCCL_TOPO_DUMP_FILE=topo.xml
or a log file with NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH
in addition to NCCL_DEBUG=INFO
I notice that you're not using GDRDMA for the RoCE adapters. Are the NICs connected to the CPU root complex or to a PCI-E switch? Is the
nvidia_peermem
kernel module loaded?
I think we should use GDRDMA for RoCE adapters, but also it seems to nvidia_peermem
module is not loaded. We will check it and try to install them if needed.
Note,
NCCL_IB_ADAPTIVE_ROUTING=1
will only lower the performance on RoCE as adaptive routing is an InfiniBand feature. Also, is there a reason you're using the IB SHARP plugin? Again, IB SHARP is an InfiniBand only feature.
We would like to unset NCCL_IB_ADAPTIVE_ROUTING
. Thanks for your comments!
And, does "IB SHARP plugin" that you mentioned means nccl_rdma_sharp_plugin
?
This plugin is contained in base image (nvcr.io/nvidia/pytorch:23.09-py3
). Is this not recommended for RoCE environments? I hope that this does not work if we set NCCL_NVLS_ENABLE=0
.
Maybe we can also get to see the node topo info from
NCCL_TOPO_DUMP_FILE=topo.xml
or a log file withNCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH
in addition toNCCL_DEBUG=INFO
Here are the logs w/ and w/o errors:
NCCL_NVLS_ENABLE=0
mpirun --hostfile /data/hostfile -x NCCL_TOPO_DUMP_FILE=topo1.xml -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -x NCCL_NVLS_ENABLE=0 -x NCCL_PROTO=^LL128 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_LEVEL=1 -x NCCL_IB_TC=98 -x LD_LIBRARY_PATH -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib /data/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 --parallel_init=1 --op="all"
NCCL_NVLS_ENABLE=1
mpirun --hostfile /data/hostfile -x NCCL_TOPO_DUMP_FILE=topo2.xml -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -x NCCL_NVLS_ENABLE=1 -x NCCL_PROTO=^LL128 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_LEVEL=1 -x NCCL_IB_TC=98 -x LD_LIBRARY_PATH -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib /data/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 --parallel_init=1 --op="all"
Weird indeed. I would check the following:
v2.19.3-1
tag). I think we fixed a couple of issues with the NVLSTree
algorithm.Try NCCL 2.19.3 (v2.19.3-1 tag). I think we fixed a couple of issues with the NVLSTree algorithm. Fix GPU Direct RDMA and see if the issue persists
We'll try newer NCCL version after new nvcr image will be released, but it would be several month later. And then we will also fix GPU Direct RDMA issues on our environment.
Run single node with NVLS and check there is no data corruption.
We already confirmed that running on single-node has no errors even if NVLS enabled.
$ mpirun --hostfile /data/hostfile -x NCCL_TOPO_DUMP_FILE=topo3.xml -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -x NCCL_NVLS_ENABLE=1 -x NCCL_PROTO=^LL128 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_LEVEL=PIX -x NCCL_IB_TC=98 -x LD_LIBRARY_PATH -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib /data/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 --parallel_init=1 --op="all"
mpirun --hostfile /data/hostfile2 -x NCCL_TOPO_DUMP_FILE=topo4.xml -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -x NCCL_NVLS_ENABLE=1 -x NCCL_PROTO=^LL128 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_LEVEL=PIX -x NCCL_IB_TC=98 -x LD_LIBRARY_PATH -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib /data/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 --parallel_init=1 --op="all"
Finally, we got expected all_reduce_perf
result without errors after re-installing OS, enabling nvidia_peermem
and updating version or configuring settings of BIOS/NIC.
We are not assure what is the exact cause of the previous errors, but currently it produces healthy results without errors whether NCCL_NVLS_ENABLE
is set to 1 or 0.
The following documentation was (partially) helpful for us:
Let me close this issue. Thank you for providing NCCL debugging knowledges and advices, it was great helpful!
Hello, we found that the
all_reduce_perf
(op=sum) is always failed when the size is over specific values in our environment.The situation is followings:
all_reduce_perf
with--op=sum
always fails with error ofout of bounds values
when over specific sizeall_reduce_perf
with other operands (prod, min, max, avg, mulsum
) has no errors or wrongsallgather_perf
oralltoall_perf
,reduce_perf
have no errors and wrongsDoes anyone know this kind of situation?
Environment
We used
nvcr.io/nvidia/pytorch:23.09-py3
based custom containers on k8s.2.18.5-1+cuda12.2
V12.2.128
4.1.5rc2
535.104.12
Log
$ mpirun --hostfile /data/hostfile -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_ADAPTIVE_ROUTING=1 -x NCCL_NET_GDR_LEVEL=1 -x NCCL_IB_TC=98 -x LD_LIBRARY_PATH -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib /data/nccl-tests/build/all_reduce_perf -b 192K -e 200K -f 1 -i 1024 -g 1 --parallel_init=1 --op="all" ...
nThread 1 nGpus 1 minBytes 196608 maxBytes 204800 step: 1024(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
Parallel Init Enabled: threads call into NcclInitRank concurrently
#
Using devices
Rank 0 Group 0 Pid 39383 on gpt-neox-worker-0 device 0 [0x19] NVIDIA H100 80GB HBM3
Rank 1 Group 0 Pid 39384 on gpt-neox-worker-0 device 1 [0x3b] NVIDIA H100 80GB HBM3
Rank 2 Group 0 Pid 39385 on gpt-neox-worker-0 device 2 [0x4c] NVIDIA H100 80GB HBM3
Rank 3 Group 0 Pid 39386 on gpt-neox-worker-0 device 3 [0x5d] NVIDIA H100 80GB HBM3
Rank 4 Group 0 Pid 39387 on gpt-neox-worker-0 device 4 [0x9b] NVIDIA H100 80GB HBM3
Rank 5 Group 0 Pid 39388 on gpt-neox-worker-0 device 5 [0xbb] NVIDIA H100 80GB HBM3
Rank 6 Group 0 Pid 39389 on gpt-neox-worker-0 device 6 [0xcb] NVIDIA H100 80GB HBM3
Rank 7 Group 0 Pid 39390 on gpt-neox-worker-0 device 7 [0xdb] NVIDIA H100 80GB HBM3
Rank 8 Group 0 Pid 17428 on gpt-neox-worker-1 device 0 [0x19] NVIDIA H100 80GB HBM3
Rank 9 Group 0 Pid 17429 on gpt-neox-worker-1 device 1 [0x3b] NVIDIA H100 80GB HBM3
Rank 10 Group 0 Pid 17430 on gpt-neox-worker-1 device 2 [0x4c] NVIDIA H100 80GB HBM3
Rank 11 Group 0 Pid 17431 on gpt-neox-worker-1 device 3 [0x5d] NVIDIA H100 80GB HBM3
Rank 12 Group 0 Pid 17432 on gpt-neox-worker-1 device 4 [0x9b] NVIDIA H100 80GB HBM3
Rank 13 Group 0 Pid 17433 on gpt-neox-worker-1 device 5 [0xbb] NVIDIA H100 80GB HBM3
Rank 14 Group 0 Pid 17434 on gpt-neox-worker-1 device 6 [0xcb] NVIDIA H100 80GB HBM3
Rank 15 Group 0 Pid 17435 on gpt-neox-worker-1 device 7 [0xdb] NVIDIA H100 80GB HBM3
...
out-of-place in-place
size count type redop root time algbw busbw #wrong time algbw busbw #wrong
(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
...
Out of bounds values : 160 FAILED
Avg bus bandwidth : 3.18491
#
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun.real detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[21561,1],4] Exit code: 1