Open kehuanfeng opened 4 years ago
It could be thread scheduling indeed. Pinning different threads to different cores (using sched_setaffinity) can help with that. To pin the NCCL network thread to a specific (set of) core(s), you can call sched_setaffinity before ncclCommInitRank is called, and restore the affinity after the call.
@sjeaugey Thanks for the reply.
I am trying to pin network thread to core 10 which is on the same numa node with nic, but it couldn't help the issue.
Another information is that we are upgrading our host system from linux kernel 3.10 to 4.19, and it has been working normally for kernel 3.10.
We've ever suspected there is something wrong with nic or its driver, but we can't prove that, since there are also much GPU activities invoved for ncclAllReduce. And it's also not easy for to bisect kernel changes from 3.10 to 4.19.
So the question becomes how to figure out it's network related or GPU related issue, so that we can seek help from our vendor accordingly.
Below is the nvprof result (with performing ncclAllReduce on two nodes with enabling single GPU on each of them).
mpirun --allow-run-as-root -np 2 -H $node_list -map-by slot -bind-to none \
-x NCCL_DEBUG=INFO \
-x NCCL_ALGO=RING \
-x NCCL_IB_DISABLE=1 \
-x NCCL_SOCKET_IFNAME=bond1 \
-x PATH \
-x LD_LIBRARY_PATH \
-mca btl_tcp_if_include bond1 \
-mca pml ob1 \
-mca btl ^openib \
nvprof ./ring_512KB 2>&1 | tee xxxx.log
We can see on one node, the kernel 'ncclAllReduceRingLLKernel_sum_f32' took 6.62078s at max! Do you think what might cause such big latency on GPU kernel?
I am trying to understand whether this kernel is waiting for remote data which may be blocked somewhere...
And I looked into the kernel source and tried to understand how presistentThread (defined in transport.cc) interact with all reduce kernel, like ncclAllReduceRingLLKernel which is defined in all_reduce.h.
I am confused with the kernel src, like for send, how storeLL() could achieve data communication between nodes or intra-node through one asmembly inst...?
__device__ void storeLL(union ncclLLFifoLine* dst, uint64_t val, uint32_t flag) {
asm volatile("st.volatile.global.v4.u32 [%0], {%1,%2,%3,%4};" :: "l"(&dst->i4), "r"((uint32_t)val), "r"(flag), "r"((uint32_t)(val >> 32)), "r"(flag));
}
It's not surprising to see a maximum of 6 seconds as it could be just the first call, when the other GPUs are not started yet (so pretty much the CUDA/NCCL initialization time).
The problem here is the bumps which repeat over time. Are those regular or random ?
There could be different reasons for those :
Hi NCCL team,
Here is my env,
Ubunut 18.04 GPU driver 418.67 CUDA 10.0.130 NCCL 2.6.4-1 OFED 5.0 OMPI 4.0
The test performs ncclAllReduce infinitely between two nodes with 8 GPUs for each and we want to check if the bandwidth is stable or not. The code is like below,
And we are suprised that there are big jitter appearing during the test, as shown below, in general, the elapsed time is less than 1 ms, but sometimes it shows more than 200 ms latency.
So we tried to understand where it comes from, transport proxy or kernel in gpu?
When we digged into the logic of nccl, we found it's not that straightforward, so we couldn't break down the big jitter.
Could you please suggest what we can do?
BTW, we can see the big jitter no matter it's RDMA or TCP socket.