NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

When I am running on multiple nodes, I can get the corresponding results when running on 3 nodes, and an exception will occur when more than 3 nodes are executed. #161

Open songqimao opened 1 year ago

songqimao commented 1 year ago

If more than 3 nodes report an error, if np 24 is specified, any one of the 3 nodes running normally will run abnormally. Here are my run parameters: mpirun -np 32 -H node1:8,node2:8,node5:8,node3:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=ib -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6, -x NCCL_NET_GDR_LEVEL=2 -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_IB_TC=160 -x NCCL_IB_TIMEOUT=22 -x NCCL_PXN_DISABLE=0 -x NCCL_MIN_CTAS=4 -x LD_LIBRARY_PATH -x PATH -mca coll_hcoll_enable 0 -mca pml ob1 -mca btl_tcp_if_include 12.200.0.0/16 -mca btl ^openib /root/cuda_package/nccl-tests/build/all_reduce_perf -b 1G -e 1G -n 1000 -g 1 image Change the np in my running parameters to 24 -H and specify any 3 nodes (node1-node8) to run normally. Please help me point out the problem. Thanks.

sjeaugey commented 1 year ago

This looks like an MPI issue.

I've never seen that, but maybe the error message shown by MPI is the reason for the error. Did you make sure node5 has the necessary library for compression to work? (e.g. has libz installed)

songqimao commented 1 year ago

image @sjeaugey Thank you for your reply, I checked the loaded library and got it as shown in the figure. It seems that there is libz. There may be problems with my method. Please correct me.

lcw2 commented 10 months ago

I encountered this issue and resolved it by yum install zlib. and make openmpi again