NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

all_reduce_perf(--op='sum') get wrong results when size is over specific value #167

Closed metaVariable closed 9 months ago

metaVariable commented 11 months ago

Hello, we found that the all_reduce_perf (op=sum) is always failed when the size is over specific values in our environment.

The situation is followings:

Does anyone know this kind of situation?

Environment

We used nvcr.io/nvidia/pytorch:23.09-py3 based custom containers on k8s.

Log

$ mpirun --hostfile /data/hostfile -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_ADAPTIVE_ROUTING=1 -x NCCL_NET_GDR_LEVEL=1 -x NCCL_IB_TC=98 -x LD_LIBRARY_PATH -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib /data/nccl-tests/build/all_reduce_perf -b 192K -e 200K -f 1 -i 1024 -g 1 --parallel_init=1 --op="all" ...

nThread 1 nGpus 1 minBytes 196608 maxBytes 204800 step: 1024(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Parallel Init Enabled: threads call into NcclInitRank concurrently

#

Using devices

Rank 0 Group 0 Pid 39383 on gpt-neox-worker-0 device 0 [0x19] NVIDIA H100 80GB HBM3

Rank 1 Group 0 Pid 39384 on gpt-neox-worker-0 device 1 [0x3b] NVIDIA H100 80GB HBM3

Rank 2 Group 0 Pid 39385 on gpt-neox-worker-0 device 2 [0x4c] NVIDIA H100 80GB HBM3

Rank 3 Group 0 Pid 39386 on gpt-neox-worker-0 device 3 [0x5d] NVIDIA H100 80GB HBM3

Rank 4 Group 0 Pid 39387 on gpt-neox-worker-0 device 4 [0x9b] NVIDIA H100 80GB HBM3

Rank 5 Group 0 Pid 39388 on gpt-neox-worker-0 device 5 [0xbb] NVIDIA H100 80GB HBM3

Rank 6 Group 0 Pid 39389 on gpt-neox-worker-0 device 6 [0xcb] NVIDIA H100 80GB HBM3

Rank 7 Group 0 Pid 39390 on gpt-neox-worker-0 device 7 [0xdb] NVIDIA H100 80GB HBM3

Rank 8 Group 0 Pid 17428 on gpt-neox-worker-1 device 0 [0x19] NVIDIA H100 80GB HBM3

Rank 9 Group 0 Pid 17429 on gpt-neox-worker-1 device 1 [0x3b] NVIDIA H100 80GB HBM3

Rank 10 Group 0 Pid 17430 on gpt-neox-worker-1 device 2 [0x4c] NVIDIA H100 80GB HBM3

Rank 11 Group 0 Pid 17431 on gpt-neox-worker-1 device 3 [0x5d] NVIDIA H100 80GB HBM3

Rank 12 Group 0 Pid 17432 on gpt-neox-worker-1 device 4 [0x9b] NVIDIA H100 80GB HBM3

Rank 13 Group 0 Pid 17433 on gpt-neox-worker-1 device 5 [0xbb] NVIDIA H100 80GB HBM3

Rank 14 Group 0 Pid 17434 on gpt-neox-worker-1 device 6 [0xcb] NVIDIA H100 80GB HBM3

Rank 15 Group 0 Pid 17435 on gpt-neox-worker-1 device 7 [0xdb] NVIDIA H100 80GB HBM3

...

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

  196608         49152     float     sum      -1    130.1    1.51    2.83      0    123.4    1.59    2.99      0
  197632         49408     float     sum      -1    122.8    1.61    3.02      0    122.9    1.61    3.01      0
  198656         49664     float     sum      -1    123.0    1.62    3.03      0    122.6    1.62    3.04      0
  199680         49920     float     sum      -1    123.1    1.62    3.04      0    122.5    1.63    3.06      0
  200704         50176     float     sum      -1    80.23    2.50    4.69  523848    77.11    2.60    4.88  523640
  201728         50432     float     sum      -1    76.69    2.63    4.93  527616    76.85    2.63    4.92  527952
  202752         50688     float     sum      -1    76.30    2.66    4.98  531776    76.13    2.66    4.99  532264
  203776         50944     float     sum      -1    76.27    2.67    5.01  535680    76.92    2.65    4.97  535600
  204800         51200     float     sum      -1    76.48    2.68    5.02  539424    76.32    2.68    5.03  539376
  196608         49152     float    prod      -1    124.9    1.57    2.95      0    125.2    1.57    2.95      0
  197632         49408     float    prod      -1    125.6    1.57    2.95      0    125.2    1.58    2.96      0
  198656         49664     float    prod      -1    125.3    1.58    2.97      0    125.4    1.58    2.97      0
  199680         49920     float    prod      -1    125.2    1.59    2.99      0    124.6    1.60    3.00      0
  200704         50176     float    prod      -1    125.4    1.60    3.00      0    124.8    1.61    3.01      0
  201728         50432     float    prod      -1    124.6    1.62    3.04      0    125.0    1.61    3.02      0
  202752         50688     float    prod      -1    126.2    1.61    3.01      0    126.3    1.61    3.01      0
  203776         50944     float    prod      -1    125.1    1.63    3.05      0    126.2    1.62    3.03      0
  204800         51200     float    prod      -1    125.1    1.64    3.07      0    125.7    1.63    3.05      0
  196608         49152     float     max      -1    124.8    1.58    2.95      0    125.0    1.57    2.95      0
  197632         49408     float     max      -1    125.2    1.58    2.96      0    124.7    1.58    2.97      0
  198656         49664     float     max      -1    125.3    1.59    2.97      0    125.0    1.59    2.98      0
  199680         49920     float     max      -1    125.8    1.59    2.98      0    125.1    1.60    2.99      0
  200704         50176     float     max      -1    125.3    1.60    3.00      0    124.6    1.61    3.02      0
  201728         50432     float     max      -1    124.4    1.62    3.04      0    125.0    1.61    3.03      0
  202752         50688     float     max      -1    124.6    1.63    3.05      0    125.9    1.61    3.02      0
  203776         50944     float     max      -1    125.2    1.63    3.05      0    125.8    1.62    3.04      0
  204800         51200     float     max      -1    125.6    1.63    3.06      0    124.6    1.64    3.08      0
  196608         49152     float     min      -1    125.5    1.57    2.94      0    124.9    1.57    2.95      0
  197632         49408     float     min      -1    125.4    1.58    2.96      0    124.4    1.59    2.98      0
  198656         49664     float     min      -1    125.0    1.59    2.98      0    125.9    1.58    2.96      0
  199680         49920     float     min      -1    125.4    1.59    2.98      0    125.2    1.59    2.99      0
  200704         50176     float     min      -1    126.1    1.59    2.99      0    125.4    1.60    3.00      0
  201728         50432     float     min      -1    125.2    1.61    3.02      0    125.3    1.61    3.02      0
  202752         50688     float     min      -1    124.9    1.62    3.04      0    125.0    1.62    3.04      0
  203776         50944     float     min      -1    125.4    1.62    3.05      0    125.1    1.63    3.06      0
  204800         51200     float     min      -1    125.2    1.64    3.07      0    125.2    1.64    3.07      0
  196608         49152     float     avg      -1    125.4    1.57    2.94      0    125.2    1.57    2.94      0
  197632         49408     float     avg      -1    125.3    1.58    2.96      0    125.1    1.58    2.96      0
  198656         49664     float     avg      -1    124.2    1.60    3.00      0    125.0    1.59    2.98      0
  199680         49920     float     avg      -1    124.7    1.60    3.00      0    124.9    1.60    3.00      0
  200704         50176     float     avg      -1    126.2    1.59    2.98      0    127.1    1.58    2.96      0
  201728         50432     float     avg      -1    125.0    1.61    3.03      0    125.5    1.61    3.01      0
  202752         50688     float     avg      -1    124.6    1.63    3.05      0    124.2    1.63    3.06      0
  203776         50944     float     avg      -1    124.8    1.63    3.06      0    125.0    1.63    3.06      0
  204800         51200     float     avg      -1    125.0    1.64    3.07      0    125.1    1.64    3.07      0
  196608         49152     float  mulsum      -1    125.2    1.57    2.94      0    125.2    1.57    2.94      0
  197632         49408     float  mulsum      -1    125.5    1.57    2.95      0    124.9    1.58    2.97      0
  198656         49664     float  mulsum      -1    124.4    1.60    2.99      0    124.8    1.59    2.98      0
  199680         49920     float  mulsum      -1    125.0    1.60    3.00      0    125.1    1.60    2.99      0
  200704         50176     float  mulsum      -1    125.8    1.59    2.99      0    124.9    1.61    3.01      0
  201728         50432     float  mulsum      -1    125.5    1.61    3.01      0    125.8    1.60    3.01      0
  202752         50688     float  mulsum      -1    125.0    1.62    3.04      0    125.1    1.62    3.04      0
  203776         50944     float  mulsum      -1    125.2    1.63    3.05      0    125.0    1.63    3.06      0
  204800         51200     float  mulsum      -1    125.2    1.64    3.07      0    125.4    1.63    3.06      0

...

Out of bounds values : 160 FAILED

Avg bus bandwidth : 3.18491

#


Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun.real detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[21561,1],4] Exit code: 1

sjeaugey commented 11 months ago

Can you check again with NCCL_PROTO=^LL128 and see if the errors disappear?

sjeaugey commented 11 months ago

Another thing worth trying: NCCL_NVLS_ENABLE=0.

Also, can you run from 8B to 8GB (-b 8 -e 8G -f 2) so that we can see the whole picture?

metaVariable commented 11 months ago

@sjeaugey Here is logs when adding args NCCL_PROTO=^LL128 and NCCL_NVLS_ENABLE=0, -b 8 -e 8G -f 2.

When setting NCCL_NVLS_ENABLE=0, the error was disappeared. So NVLink SHARP plugin or NVSwitch seems to be the cause of this problem.

Enable the use of NVLink SHARP (NVLS). NVLink SHARP is available in third-generation NVSwitch systems (NVLink4) with Hopper and later GPU architectures, allowing collectives such as ncclAllReduce to be offloaded to the NVSwitch domain. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable

We will check the status and configurations of related components.


Appendix: Commands & Logs

+ NCCL_PROTO=^LL128 + -b 8 -e 8G -f 2

+ NCCL_NVLS_ENABLE=0 + -b 8 -e 8G -f 2

+ NCCL_PROTO=^LL128 + NCCL_NVLS_ENABLE=0 + -b 8 -e 8G -f 2

AddyLaddy commented 11 months ago

That is interesting, we've never seen that issue with NVLS before. I notice that you're not using GDRDMA for the RoCE adapters. Are the NICs connected to the CPU root complex or to a PCI-E switch? Is the nvidia_peermem kernel module loaded? Note, NCCL_IB_ADAPTIVE_ROUTING=1 will only lower the performance on RoCE as adaptive routing is an InfiniBand feature. Also, is there a reason you're using the IB SHARP plugin? Again, IB SHARP is an InfiniBand only feature.

Maybe we can also get to see the node topo info from NCCL_TOPO_DUMP_FILE=topo.xml or a log file with NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH in addition to NCCL_DEBUG=INFO

metaVariable commented 11 months ago

I notice that you're not using GDRDMA for the RoCE adapters. Are the NICs connected to the CPU root complex or to a PCI-E switch? Is the nvidia_peermem kernel module loaded?

I think we should use GDRDMA for RoCE adapters, but also it seems to nvidia_peermem module is not loaded. We will check it and try to install them if needed.

Note, NCCL_IB_ADAPTIVE_ROUTING=1 will only lower the performance on RoCE as adaptive routing is an InfiniBand feature. Also, is there a reason you're using the IB SHARP plugin? Again, IB SHARP is an InfiniBand only feature.

We would like to unset NCCL_IB_ADAPTIVE_ROUTING. Thanks for your comments!

And, does "IB SHARP plugin" that you mentioned means nccl_rdma_sharp_plugin? This plugin is contained in base image (nvcr.io/nvidia/pytorch:23.09-py3). Is this not recommended for RoCE environments? I hope that this does not work if we set NCCL_NVLS_ENABLE=0.

Maybe we can also get to see the node topo info from NCCL_TOPO_DUMP_FILE=topo.xml or a log file with NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH in addition to NCCL_DEBUG=INFO

Here are the logs w/ and w/o errors:

sjeaugey commented 11 months ago

Weird indeed. I would check the following:

metaVariable commented 11 months ago

Try NCCL 2.19.3 (v2.19.3-1 tag). I think we fixed a couple of issues with the NVLSTree algorithm. Fix GPU Direct RDMA and see if the issue persists

We'll try newer NCCL version after new nvcr image will be released, but it would be several month later. And then we will also fix GPU Direct RDMA issues on our environment.

Run single node with NVLS and check there is no data corruption.

We already confirmed that running on single-node has no errors even if NVLS enabled.

case: allreduce on 1 node with NCCL_NVLS_ENABLE =1

$ mpirun --hostfile /data/hostfile -x NCCL_TOPO_DUMP_FILE=topo3.xml -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -x NCCL_NVLS_ENABLE=1 -x NCCL_PROTO=^LL128 -x NCCL_IB_GID_INDEX=3 -x NCCL_NET_GDR_LEVEL=PIX -x NCCL_IB_TC=98 -x LD_LIBRARY_PATH -x PATH -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib /data/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 --parallel_init=1 --op="all"
metaVariable commented 9 months ago

Finally, we got expected all_reduce_perf result without errors after re-installing OS, enabling nvidia_peermem and updating version or configuring settings of BIOS/NIC.

We are not assure what is the exact cause of the previous errors, but currently it produces healthy results without errors whether NCCL_NVLS_ENABLE is set to 1 or 0.

case: 2 node, NCCL_NVLS_ENABLE =1

case: 2 node, NCCL_NVLS_ENABLE =0

Note

The following documentation was (partially) helpful for us:

metaVariable commented 9 months ago

Let me close this issue. Thank you for providing NCCL debugging knowledges and advices, it was great helpful!