Closed scvance closed 1 year ago
Regarding:
NCCL_IB_HCA='^=mlx5_8,^=mlx5_14'
You should use ^
and =
only once, to indicate the list is exact and it is an exclude list.
So, I'd set NCCL_IB_HCA=^=mlx5_8,mlx5_14
.
Thanks, this actually helped a lot!!
I am trying to run the all_reduce_perf test on two nodes, both of which have 8 A100 80GB gpus. I am supposed to be using an infiniband HDR network.
I am using the slurm job manager. When I run the command NCCL_DEBUG=INFO UCX_NET_DEVICES=mlx5_1:1,mlx5_11:1,mlx5_13:1,mlx5_3:1,mlx5_5:1,mlx5_7:1,mlx5_10:1,mlx5_12:1,mlx5_16:1,mlx5_18:1,mlx5_4:1,mlx5_6:1,mlx5_0:1,mlx5_2:1 UCX_TLS=ud NCCL_IB_HCA='^=mlx5_8,^=mlx5_14' srun --mpi=pmix all_reduce_perf -b 8 -e 128M -f 2 -g 1 -t 1
I use only two gpus, one per node, and I get an output as follows
I am able to increase the amount of gpus per node up until I try to use 5 or more, at which point I get this error:
The program then hangs at this point without terminating. If I open nvidia-smi I see that all gpus claim 100% utilization
Has anyone run into this problem before?