I am testing the performance of AllReduce on 3 nodes with 24 GPUs. And when I opened the profiling report in Nsight system, I was surprised by a ncclAllReduce which took about 5 seconds as below picture shows. It's too long.
By checking the code of nccl-tests, it looks this ncclAllReduce happened after the 5 iterations of large-size warm-up and 5 iterations of small-size warm-up. I'm wondering why that ncclAllReduce took so much time by comparing to other ncclAllReduce even though it didn't affect the result algorithm/bus bandwidth?
I am testing the performance of AllReduce on 3 nodes with 24 GPUs. And when I opened the profiling report in Nsight system, I was surprised by a ncclAllReduce which took about 5 seconds as below picture shows. It's too long.
By checking the code of nccl-tests, it looks this ncclAllReduce happened after the 5 iterations of large-size warm-up and 5 iterations of small-size warm-up. I'm wondering why that ncclAllReduce took so much time by comparing to other ncclAllReduce even though it didn't affect the result algorithm/bus bandwidth?
Thanks in advance.