NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.22k stars 813 forks source link

nccltest allreduce is with a lot of wrongs with the NCCL_P2P_DISABLE=1 env or NCCL_PXN_DISABLE=1 env #1199

Open qdbkppkbdq opened 8 months ago

qdbkppkbdq commented 8 months ago

i has run a allreduce test with this cmd on 2 node 8 x H800 which is with 8 x 400Gb NIC,but i got a lot of wrongs:

mu cmd:mpirun --allow-run-as-root --debug-daemons -np 16 -H 10.240.32.1:8,10.240.40.1:8 --mca btl_tcp_if_include eth1x --mca oob_tcp_if_include eth1x -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SYS=INIT,GRAPH,NET -x LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH -x NCCL_P2P_DISABLE=1 -x NCCL_NET_PLUGIN=none /root/nccl-tests-master/build/all_reduce_perf -b 1000M -e 1200M -g 1 -i 1048576

image

why this happen? and i have find that this situation is also happens when i set NCCL_PXN_DISABLE=1. when i remove those envs, the wrongs disappeared. is there some bugs or some mechanism in nccl make this happen?

I was wondering if someone could help me with it. I'm really looking forward to getting your advice on this. Thank you in advance for your time and assistance.

here is the log: nccl.log

AddyLaddy commented 8 months ago

NCCL 2.15.5 is pretty old now as it was released in October 2022. I'd suggest moving to a newer NCCL release like 2.19.x or 2.20.x.

Also, can you check that you compiled the nccl-tests binaries using MPI=1 as I saw lots of init messages in that log.

qdbkppkbdq commented 8 months ago

thanks for you reply,I has checked my version,it seams that the version of nccl I has used is 2.19.4 image Thank you for your help and for taking the time to address my question. I appreciate your effort. Could there be any other factors or reasons that might be contributing to this problem?

NCCL 2.15.5 is pretty old now as it was released in October 2022. I'd suggest moving to a newer NCCL release like 2.19.x or 2.20.x.

Also, can you check that you compiled the nccl-tests binaries using MPI=1 as I saw lots of init messages in that log.

AddyLaddy commented 8 months ago

Ah OK sorry, wrong log file.

So, it only sees the corruption when you set NCCL_P2P_DISABLE=1 ? Does the data corruption occur only when you use the network adapters? Does setting NCCL_NET_GDR_LEVEL=LOC change the results? Does setting NCCL_NET_FORCE_FLUSH=1 change the results?

qdbkppkbdq commented 8 months ago

Ah OK sorry, wrong log file.

So, it only sees the corruption when you set NCCL_P2P_DISABLE=1 ? Does the data corruption occur only when you use the network adapters? Does setting NCCL_NET_GDR_LEVEL=LOC change the results? Does setting NCCL_NET_FORCE_FLUSH=1 change the results?

I have tried these envs. result is as follow:

  1. use NCCL_NET_GDR_LEVEL=LOC, the wrongs has disappeared.but the busBW became very low ablout 37GB

cmd = mpirun --allow-run-as-root --debug-daemons -np 16 -H 10.240.32.1:8,10.240.40.1:8 --mca btl_tcp_if_include eth1x --mca oob_tcp_if_include eth1x -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SYS=INIT,GRAPH -x LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH -x NCCL_P2P_DISABLE=1 -x NCCL_NET_GDR_LEVEL=LOC -x NCCL_NET_PLUGIN=none /home/weilai/nccl-tests-master/build/all_reduce_perf -b 1000M -e 1200M -g 1 -i 104857600

image

here is the log file: NCCL_NET_GDR_LEVEL_LOC.log

  1. use NCCL_NET_FORCE_FLUSH=1, the wrongs is still exist

cmd = mpirun --allow-run-as-root --debug-daemons -np 16 -H 10.240.32.1:8,10.240.40.1:8 --mca btl_tcp_if_include eth1x --mca oob_tcp_if_include eth1x -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SYS=INIT,GRAPH -x LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH -x NCCL_P2P_DISABLE=1 -x NCCL_NET_FORCE_FLUSH=1 -x NCCL_NET_PLUGIN=none /home/weilai/nccl-tests-master/build/all_reduce_perf -b 1000M -e 1200M -g 1 -i 104857600

image

here is the log file: NCCL_NET_FORCE_FLUSH1.log

sjeaugey commented 8 months ago

Can you try with NCCL_PROTO=^LL128?

spotluri commented 8 months ago

@qdbkppkbdq your runs were using the wrong env variable for DEBUG logging and hence missing some info. Its NCCL_DEBUG_SUBSYS and not NCCL_DEBUG_SYS.

qdbkppkbdq commented 8 months ago

@sjeaugey @spotluri thanks for response,i try the NCCL_PROTO=^LL128 env and correct NCCL_DEBUG_SUBSYS env. but the error is still exist . here is the cmd and log. image

cmd = mpirun --allow-run-as-root --debug-daemons -np 16 -H 10.240.32.1:8,10.240.40.1:8 --mca btl_tcp_if_include eth1x --mca oob_tcp_if_include eth1x -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH -x LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH -x NCCL_P2P_DISABLE=1 -x NCCL_PROTO=^LL128 -x NCCL_NET_PLUGIN=none /home/weilai/nccl-tests-master/build/all_reduce_perf -b 1000M -e 1200M -g 1 -i 104857600 log file: noneLL128.log

sjeaugey commented 8 months ago

It looks like the topology detection is not working well. We seem to end up with 4 groups: GPUs 0,1,2,3 on the first node, GPUs 4,5,6,7 on the first node, GPUs 0,1,2,3,7 on the second node, GPUs 4,5,6 on the second node.

I don't know why we end up in that situation. Could it be some ranks get a different value when they read /proc/sys/kernel/random/boot_id in getHostHash (src/misc/utils.cc).

qdbkppkbdq commented 8 months ago

@sjeaugey Thank you for your insightful response, I really appreciate the time and effort you put into answering my question. Building on what you've shared, I'm curious about another aspect. what is the difference between LL and LL128,when should I choose them?Could this solution (NCCL_PROTO=^LL128) work in certain scenarios to address the error?

sjeaugey commented 8 months ago

You should not have to set NCCL_PROTO. NCCL should use the right protocols when it can.

Having corrupted data denotes a bug in NCCL in general, except in cases where the hardware corrupts the data (PCI, NVLink, etc.). That's why I'd really like to understand why we land in this situation and then get corrupted data.

sjeaugey commented 8 months ago

To be clearer, there is something really weird about your setup, and the performance you'll get will be really bad anyway. So we need to figure out what's going on, get you back with 8 GPUs per node, and likely, the data corruption will go away.

Could you check the values of /proc/sys/kernel/random/boot_id on each rank? You can run mpirun --tag-output [mpirun args] cat /proc/sys/kernel/random/boot_id to confirm that.

qdbkppkbdq commented 7 months ago

To be clearer, there is something really weird about your setup, and the performance you'll get will be really bad anyway. So we need to figure out what's going on, get you back with 8 GPUs per node, and likely, the data corruption will go away.

Could you check the values of /proc/sys/kernel/random/boot_id on each rank? You can run mpirun --tag-output [mpirun args] cat /proc/sys/kernel/random/boot_id to confirm that.

@sjeaugey here is the result: image

sjeaugey commented 7 months ago

Thanks, the boot IDs seem correct to me. For some reason NCCL decides to split the nodes in 2 parts, and in a weird manner (5+3 and 4+4).

We need to dig further to understand why that happens.