Open qdbkppkbdq opened 8 months ago
NCCL 2.15.5 is pretty old now as it was released in October 2022. I'd suggest moving to a newer NCCL release like 2.19.x or 2.20.x.
Also, can you check that you compiled the nccl-tests binaries using MPI=1 as I saw lots of init messages in that log.
thanks for you reply,I has checked my version,it seams that the version of nccl I has used is 2.19.4 Thank you for your help and for taking the time to address my question. I appreciate your effort. Could there be any other factors or reasons that might be contributing to this problem?
NCCL 2.15.5 is pretty old now as it was released in October 2022. I'd suggest moving to a newer NCCL release like 2.19.x or 2.20.x.
Also, can you check that you compiled the nccl-tests binaries using MPI=1 as I saw lots of init messages in that log.
Ah OK sorry, wrong log file.
So, it only sees the corruption when you set NCCL_P2P_DISABLE=1
?
Does the data corruption occur only when you use the network adapters?
Does setting NCCL_NET_GDR_LEVEL=LOC
change the results?
Does setting NCCL_NET_FORCE_FLUSH=1
change the results?
Ah OK sorry, wrong log file.
So, it only sees the corruption when you set
NCCL_P2P_DISABLE=1
? Does the data corruption occur only when you use the network adapters? Does settingNCCL_NET_GDR_LEVEL=LOC
change the results? Does settingNCCL_NET_FORCE_FLUSH=1
change the results?
I have tried these envs. result is as follow:
cmd = mpirun --allow-run-as-root --debug-daemons -np 16 -H 10.240.32.1:8,10.240.40.1:8 --mca btl_tcp_if_include eth1x --mca oob_tcp_if_include eth1x -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SYS=INIT,GRAPH -x LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH -x NCCL_P2P_DISABLE=1 -x NCCL_NET_GDR_LEVEL=LOC -x NCCL_NET_PLUGIN=none /home/weilai/nccl-tests-master/build/all_reduce_perf -b 1000M -e 1200M -g 1 -i 104857600
here is the log file: NCCL_NET_GDR_LEVEL_LOC.log
cmd = mpirun --allow-run-as-root --debug-daemons -np 16 -H 10.240.32.1:8,10.240.40.1:8 --mca btl_tcp_if_include eth1x --mca oob_tcp_if_include eth1x -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SYS=INIT,GRAPH -x LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH -x NCCL_P2P_DISABLE=1 -x NCCL_NET_FORCE_FLUSH=1 -x NCCL_NET_PLUGIN=none /home/weilai/nccl-tests-master/build/all_reduce_perf -b 1000M -e 1200M -g 1 -i 104857600
here is the log file: NCCL_NET_FORCE_FLUSH1.log
Can you try with NCCL_PROTO=^LL128
?
@qdbkppkbdq your runs were using the wrong env variable for DEBUG logging and hence missing some info. Its NCCL_DEBUG_SUBSYS and not NCCL_DEBUG_SYS.
@sjeaugey @spotluri thanks for response,i try the NCCL_PROTO=^LL128 env and correct NCCL_DEBUG_SUBSYS env. but the error is still exist . here is the cmd and log.
cmd = mpirun --allow-run-as-root --debug-daemons -np 16 -H 10.240.32.1:8,10.240.40.1:8 --mca btl_tcp_if_include eth1x --mca oob_tcp_if_include eth1x -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH -x LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH -x NCCL_P2P_DISABLE=1 -x NCCL_PROTO=^LL128 -x NCCL_NET_PLUGIN=none /home/weilai/nccl-tests-master/build/all_reduce_perf -b 1000M -e 1200M -g 1 -i 104857600
log file:
noneLL128.log
It looks like the topology detection is not working well. We seem to end up with 4 groups: GPUs 0,1,2,3 on the first node, GPUs 4,5,6,7 on the first node, GPUs 0,1,2,3,7 on the second node, GPUs 4,5,6 on the second node.
I don't know why we end up in that situation. Could it be some ranks get a different value when they read /proc/sys/kernel/random/boot_id
in getHostHash
(src/misc/utils.cc).
@sjeaugey Thank you for your insightful response, I really appreciate the time and effort you put into answering my question. Building on what you've shared, I'm curious about another aspect. what is the difference between LL and LL128,when should I choose them?Could this solution (NCCL_PROTO=^LL128) work in certain scenarios to address the error?
You should not have to set NCCL_PROTO. NCCL should use the right protocols when it can.
Having corrupted data denotes a bug in NCCL in general, except in cases where the hardware corrupts the data (PCI, NVLink, etc.). That's why I'd really like to understand why we land in this situation and then get corrupted data.
To be clearer, there is something really weird about your setup, and the performance you'll get will be really bad anyway. So we need to figure out what's going on, get you back with 8 GPUs per node, and likely, the data corruption will go away.
Could you check the values of /proc/sys/kernel/random/boot_id
on each rank? You can run mpirun --tag-output [mpirun args] cat /proc/sys/kernel/random/boot_id
to confirm that.
To be clearer, there is something really weird about your setup, and the performance you'll get will be really bad anyway. So we need to figure out what's going on, get you back with 8 GPUs per node, and likely, the data corruption will go away.
Could you check the values of
/proc/sys/kernel/random/boot_id
on each rank? You can runmpirun --tag-output [mpirun args] cat /proc/sys/kernel/random/boot_id
to confirm that.
@sjeaugey here is the result:
Thanks, the boot IDs seem correct to me. For some reason NCCL decides to split the nodes in 2 parts, and in a weird manner (5+3 and 4+4).
We need to dig further to understand why that happens.
i has run a allreduce test with this cmd on 2 node 8 x H800 which is with 8 x 400Gb NIC,but i got a lot of wrongs:
mu cmd:
mpirun --allow-run-as-root --debug-daemons -np 16 -H 10.240.32.1:8,10.240.40.1:8 --mca btl_tcp_if_include eth1x --mca oob_tcp_if_include eth1x -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SYS=INIT,GRAPH,NET -x LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH -x NCCL_P2P_DISABLE=1 -x NCCL_NET_PLUGIN=none /root/nccl-tests-master/build/all_reduce_perf -b 1000M -e 1200M -g 1 -i 1048576
why this happen? and i have find that this situation is also happens when i set NCCL_PXN_DISABLE=1. when i remove those envs, the wrongs disappeared. is there some bugs or some mechanism in nccl make this happen?
I was wondering if someone could help me with it. I'm really looking forward to getting your advice on this. Thank you in advance for your time and assistance.
here is the log: nccl.log