NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

Nsight Profiling: one ncclAllReduce takes too long #184

Open yanminjia opened 8 months ago

yanminjia commented 8 months ago

I am testing the performance of AllReduce on 3 nodes with 24 GPUs. And when I opened the profiling report in Nsight system, I was surprised by a ncclAllReduce which took about 5 seconds as below picture shows. It's too long.

nsys_profiling

By checking the code of nccl-tests, it looks this ncclAllReduce happened after the 5 iterations of large-size warm-up and 5 iterations of small-size warm-up. I'm wondering why that ncclAllReduce took so much time by comparing to other ncclAllReduce even though it didn't affect the result algorithm/bus bandwidth?

Thanks in advance.