NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

nccl-tests hangs up with specific message sizes #1110

Open yanminjia opened 9 months ago

yanminjia commented 9 months ago

nccl-tests would hang up with the specific messages sizes when I tests ReduceScatter (reduce_scatter_perf). for example, in the below screen shot, it hangs at 4G message size. Some times it hangs at 128M or 512M. And cann't give the result. Have to quit by "Ctrl + C". It looks the kernel functions were launched.

image

But when I specify some kind of combination of NCCL_ALGO/NCCL_PROTO, such as RING/SIMPLE, it can work. But when I specifiy NVLSTree/Simple it cannot work. I'm not sure what went wrong. Could you please give me any clue? Many thanks. @sjeaugey

sjeaugey commented 9 months ago

There might be a bug with NVLSTree, but it's a bit weird given ReduceScatter does not have an NVLSTree implementation. So when you select that algorithm, I'm not sure what you end up using.

Your bug report is way too sparse though. You should at least mention your platform, GPU model, NCCL version, the command line launch arguments and the results.

yanminjia commented 9 months ago

Many thanks, Sylvain. This issue happens in a very heterogeneous environment including H100 and H800 servers with different PCIe structure. As you said, NVLSTree cannot be appied to ReduceScatter. Therefore, the NCCL_ALGO and NCCL_PROTO will depend on the tuner. Unfortunately, different NCCL_ALGO and NCCL_PRPTO are given by the tuner on different servers. That is the reason why nccl-tests hangs. To be frank, I have no idea of NCCL_PRPTO (Simple & LL & LL128). I don't quite understand how the message size or intra-node structure could impact tuner with regard to NCCL_ALGO and NCCL_PROTO. Could you please shed some light on what NCCL_PRPTO means exactly?

sjeaugey commented 9 months ago

There could be a bug with heterogeneous environments. Is it failing when you don't set anything? Setting those environment variables should be avoided as much as possible, especially if you don't understand fully what you're doing.

yanminjia commented 9 months ago

Yes, if I don't specify NCCL_ALGO & NCCL_PROTO, ncct-tests will got stuck at some specific message size. It would be much helpful if you could point me a link or share some information about NCCL_PRPTO. Anyway, many thanks for your information.

sjeaugey commented 9 months ago

Can you try again with NCCL_NVLS_ENABLE=0? Perhaps it's a bug when using NVLS.

yanminjia commented 9 months ago

OK, will try NCCL_NVLS_ENABLE=0. Currently, I set NCCL_ALGO & NCCL_PRPTO to ring & simple respectively. It does work. I'm not sure if any side effect with respect to AI/ML training. Thanks.

sjeaugey commented 9 months ago

I'm not sure if any side effect with respect to AI/ML training.

Certainly less than forcing NCCL_ALGO=RING NCCL_PROTO=SIMPLE. Setting NCCL_ALGO=^NVLSTREE might be even better (if it works) as it would still allow for NVLS on pure intra-node communication.

dmitrygx commented 4 months ago

@sjeaugey I tried your recommendations. But neither NCCL_NVLS_ENABLE=0 nor NCCL_ALGO=^NVLSTREE. Also, I tired setting them both, but unfortunately no luck. So, as far as I can see setting NCCL_ALGO=ring and NCCL_PROTO=simple could help on my setup. Thanks!

dmitrygx commented 3 months ago

@sjeaugey reproduced the same issue on 2 nodes with 8 GPUs per each with xHPL from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks by using the following command:

mpirun --prefix /opt/hpcx/ompi-ipv6 -x LD_LIBRARY_PATH --allow-run-as-root --bind-to none -mca orte_keep_fqdn_hostnames true -mca oob_tcp_if_include veth -x NCCL_SOCKET_IFNAME=veth -x NCCL_DEBUG=info -mca plm_rsh_num_concurrent 300 -mca routed_radix 600 -mca plm_rsh_no_tree_spawn 1 -mca pmix_base_async_modex 1 --mca btl ^openib -mca pml ucx -x HPC_WORKSPACE -hostfile /slot/sandbox/mpi_hosts.txt -np 16 ./hpl.sh --config hwcfg/auto.sh --config ./hpl_tune.sh --config ./xhpl_custom.sh --dat ~/HPL.dat

The configurations used to reproduce the issue are:

$ cat ./xhpl_custom.sh
#HPC spec config
XHPL=/opt/xhpl_cuda/xhpl_runner
$ cat ./hpl_tune.sh

#Global settings
export GPU_CLOCK_WARNING=1275
export GPU_POWER_WARNING=520
export GPU_PCIE_GEN_WARNING=4
export GPU_PCIE_WIDTH_WARNING=16

# Custom hpl settings
# export UCX_MEMTYPE_CACHE=n
# export UCX_RNDV_SCHEME=get_zcopy

export CUDA_COPY_SPLIT_THRESHOLD_MB=1
export SORT_RANKS=0

export CPU_CORES_PER_RANK=12
export GRID_STRIPE=8
export RANKS_PER_NODE=8
export RANKS_PER_SOCKET=4
export NUM_PI_BUF=6
export NUM_L2_BUF=6
export NUM_L1_BUF=6
export NUM_WORK_BUF=6
export TEST_SYSTEM_PARAMS=1
export ICHUNK_SIZE=768
export SCHUNK_SIZE=768
export CHUNK_SIZE=3456
export NUM_WORK_BUF=6
$ cat ./hwcfg/auto.sh
GPU_AFFINITY="0:1:2:3:4:5:6:7"
MEM_AFFINITY="0:1:1:0:2:3:3:2"
CPU_AFFINITY="0-23:24-47:120-143:96-119:48-71:72-95:168-191:144-167"
CPU_CORES_PER_RANK=20
if [[ -z "${NET_AFFINITY}" ]]; then
    NET_AFFINITY="mlx5_0:mlx5_1:mlx5_1:mlx5_0:mlx5_2:mlx5_3:mlx5_3:mlx5_2"
fi
$ cat ~/HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1    # of problems sizes (N)
386496         Ns
1             # of NBs
576        NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
4           Ps
4           Qs
16.0         threshold
1            # of panel fact
2        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
0          RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
3          BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
192          swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

Other scripts could be found in hpc-benchmarls docker image. If something is still missed, please, let me know, I'll add it shortly. Thanks!