Open fj1425fj opened 3 months ago
It seems on server2
rank 3 is not using device 2
but device 0
instead. I'm not actually sure how that's possible given rank 2 is also using the same device, but maybe there is an error in the launch script so you end up with 2 ranks using the same NICs?
Sorry, I made a mistake while editing. Rank3 is using device 2.
The bad performance might just be misalignment issues. If you look at the number of elements, every other size is aligned to 2 elements and every other is aligned to 1. Given those are floats we're aligned to 4 bytes or 8 bytes, but never 16 which gives good performance.
That's because we divide the total size by the number of ranks, so when you run on numbers of ranks which are not a power of two, you should use a start size that's a multiple of the number of ranks. E.g. -b 3M
instead of -b 2M
.
Thank you for your answer. I tested that this phenomenon would not occur if independent IP was used instead of bond. Do you know why?
RoCE bond network bandwidth can reach 180+ GB/s per NIC (mlx5_bond_x) when using the ib_write_bw tool. When I used four devices, the alltoall test results were as expected, but with three devices, the bandwidth was only half as expected.
Have you ever encountered this phenomenon? What are the possible reasons for this phenomenon? Looking forward to your reply.
the nccl-tests result is following
Test results of four devices:
Test results of three devices: