Nsight Profiling: one ncclAllReduce takes too long

I am testing the performance of AllReduce on 3 nodes with 24 GPUs. And when I opened the profiling report in Nsight system, I was surprised by a ncclAllReduce which took about 5 seconds as below picture shows. It's too long.

nsys_profiling

By checking the code of nccl-tests, it looks this ncclAllReduce happened after the 5 iterations of large-size warm-up and 5 iterations of small-size warm-up. I'm wondering why that ncclAllReduce took so much time by comparing to other ncclAllReduce even though it didn't affect the result algorithm/bus bandwidth?

Thanks in advance.

NVIDIA / nccl-tests

Nsight Profiling: one ncclAllReduce takes too long #184