Open songqimao opened 1 year ago
This looks like an MPI issue.
I've never seen that, but maybe the error message shown by MPI is the reason for the error. Did you make sure node5
has the necessary library for compression to work? (e.g. has libz
installed)
@sjeaugey Thank you for your reply, I checked the loaded library and got it as shown in the figure. It seems that there is libz. There may be problems with my method. Please correct me.
I encountered this issue and resolved it by yum install zlib. and make openmpi again
If more than 3 nodes report an error, if np 24 is specified, any one of the 3 nodes running normally will run abnormally. Here are my run parameters: mpirun -np 32 -H node1:8,node2:8,node5:8,node3:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=ib -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6, -x NCCL_NET_GDR_LEVEL=2 -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_IB_TC=160 -x NCCL_IB_TIMEOUT=22 -x NCCL_PXN_DISABLE=0 -x NCCL_MIN_CTAS=4 -x LD_LIBRARY_PATH -x PATH -mca coll_hcoll_enable 0 -mca pml ob1 -mca btl_tcp_if_include 12.200.0.0/16 -mca btl ^openib /root/cuda_package/nccl-tests/build/all_reduce_perf -b 1G -e 1G -n 1000 -g 1 Change the np in my running parameters to 24 -H and specify any 3 nodes (node1-node8) to run normally. Please help me point out the problem. Thanks.