NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.23k stars 815 forks source link

How to figure out where the big jitter for ncclAllReduce come from? #361

Open kehuanfeng opened 4 years ago

kehuanfeng commented 4 years ago

Hi NCCL team,

Here is my env,

Ubunut 18.04 GPU driver 418.67 CUDA 10.0.130 NCCL 2.6.4-1 OFED 5.0 OMPI 4.0

The test performs ncclAllReduce infinitely between two nodes with 8 GPUs for each and we want to check if the bandwidth is stable or not. The code is like below,

while (true) {
   ...
   ncclAllReduce(...)
   cudaStreamSynchronize()
   ....
}

And we are suprised that there are big jitter appearing during the test, as shown below, in general, the elapsed time is less than 1 ms, but sometimes it shows more than 200 ms latency.

size: 512KB, use_time: 757us, speed: 0.645021GB/s
size: 512KB, use_time: 840us, speed: 0.581287GB/s
size: 512KB, use_time: 42313us, speed: 0.011540GB/s
size: 512KB, use_time: 775us, speed: 0.630040GB/s
size: 512KB, use_time: 860us, speed: 0.567769GB/s
size: 512KB, use_time: 852us, speed: 0.573100GB/s
size: 512KB, use_time: 686us, speed: 0.711780GB/s
size: 512KB, use_time: 746us, speed: 0.654533GB/s
size: 512KB, use_time: 869us, speed: 0.561889GB/s
size: 512KB, use_time: 780us, speed: 0.626002GB/s
size: 512KB, use_time: 701us, speed: 0.696550GB/s
size: 512KB, use_time: 688us, speed: 0.709711GB/s
size: 512KB, use_time: 659us, speed: 0.740943GB/s
size: 512KB, use_time: 209966us, speed: 0.002326GB/s
size: 512KB, use_time: 705us, speed: 0.692598GB/s
size: 512KB, use_time: 761us, speed: 0.641631GB/s

So we tried to understand where it comes from, transport proxy or kernel in gpu?
When we digged into the logic of nccl, we found it's not that straightforward, so we couldn't break down the big jitter.

Could you please suggest what we can do?

BTW, we can see the big jitter no matter it's RDMA or TCP socket.

sjeaugey commented 4 years ago

It could be thread scheduling indeed. Pinning different threads to different cores (using sched_setaffinity) can help with that. To pin the NCCL network thread to a specific (set of) core(s), you can call sched_setaffinity before ncclCommInitRank is called, and restore the affinity after the call.

kehuanfeng commented 4 years ago

@sjeaugey Thanks for the reply.

I am trying to pin network thread to core 10 which is on the same numa node with nic, but it couldn't help the issue.

affinity

kehuanfeng commented 4 years ago

Another information is that we are upgrading our host system from linux kernel 3.10 to 4.19, and it has been working normally for kernel 3.10.

We've ever suspected there is something wrong with nic or its driver, but we can't prove that, since there are also much GPU activities invoved for ncclAllReduce. And it's also not easy for to bisect kernel changes from 3.10 to 4.19.

So the question becomes how to figure out it's network related or GPU related issue, so that we can seek help from our vendor accordingly.

kehuanfeng commented 4 years ago

Below is the nvprof result (with performing ncclAllReduce on two nodes with enabling single GPU on each of them).

mpirun --allow-run-as-root -np 2 -H $node_list -map-by slot -bind-to none \
-x NCCL_DEBUG=INFO \
-x NCCL_ALGO=RING \
-x NCCL_IB_DISABLE=1 \
-x NCCL_SOCKET_IFNAME=bond1 \
-x PATH \
-x LD_LIBRARY_PATH \
-mca btl_tcp_if_include bond1 \
-mca pml ob1 \
-mca btl ^openib \
nvprof ./ring_512KB 2>&1 | tee xxxx.log

We can see on one node, the kernel 'ncclAllReduceRingLLKernel_sum_f32' took 6.62078s at max! Do you think what might cause such big latency on GPU kernel?

nvprof

kehuanfeng commented 4 years ago

I am trying to understand whether this kernel is waiting for remote data which may be blocked somewhere...

And I looked into the kernel source and tried to understand how presistentThread (defined in transport.cc) interact with all reduce kernel, like ncclAllReduceRingLLKernel which is defined in all_reduce.h.

I am confused with the kernel src, like for send, how storeLL() could achieve data communication between nodes or intra-node through one asmembly inst...?

  __device__ void storeLL(union ncclLLFifoLine* dst, uint64_t val, uint32_t flag) {
    asm volatile("st.volatile.global.v4.u32 [%0], {%1,%2,%3,%4};" :: "l"(&dst->i4), "r"((uint32_t)val), "r"(flag), "r"((uint32_t)(val >> 32)), "r"(flag));
  }
sjeaugey commented 4 years ago

It's not surprising to see a maximum of 6 seconds as it could be just the first call, when the other GPUs are not started yet (so pretty much the CUDA/NCCL initialization time).

The problem here is the bumps which repeat over time. Are those regular or random ?

There could be different reasons for those :