all2all time consuming is fluctuating when nodes become large

de1star commented 1 year ago

Hi NCCL team, I am training a model on 98 nodes (totally 784 gpus) with 392 MoEs. What I found was when I training the model on 40 nodes with 160 MoEs, all2all always performed well. However, on 98 nodes, all2all performance in MoEs would be fluctuating between 0.0xx s to 3.xxx s, which would increase the batch time. My all2all is realized like this:

    NCCLCHECK(ncclGroupStart());
    for (int i = 0; i < world_size; ++i) {
        NCCLCHECK(ncclSend(
                local_expert_count + n_expert * i,
                n_expert,
                ncclInt32,
                i,
                NCCL_COMM_SYNC(group, device),
                NCCL_STREAM_SYNC(device)));
        NCCLCHECK(ncclRecv(
                global_expert_count + n_expert * i,
                n_expert,
                ncclInt32,
                i,
                NCCL_COMM_SYNC(group, device),
                NCCL_STREAM_SYNC(device)));
    }
    NCCLCHECK(ncclGroupEnd());

I wondered whether I should set NCCL_NTHREADS lower (default is 512) to decrease the GPU workload created by all2all.

de1star commented 1 year ago

Our GPUs are A100, but only one IB on each Node (with 8 GPUs). Our switch has two layers, spine and leaf. NCCL version is 2.15.1+cuda11.2, PXN has been used. The log on 2 Nodes:

....
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO === System : maxBw 12.0 totalBw 264.0 ===
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO CPU/0 (1/1/2)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO + PCI[12.0] - PCI/3B000 (1000c01010000000)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO               + PCI[12.0] - PCI/44000 (1000c01010de13b8)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO                             + PCI[12.0] - GPU/46000 (4)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO                                           + NVL[264.0] - NVS/0
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO               + PCI[12.0] - NIC/48000
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO                             + NET[25.0] - NET/0 (88b88e0003fd7010/1/25.000000)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO + PCI[12.0] - PCI/5D000 (1000c01010000000)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO               + PCI[12.0] - PCI/67000 (1000c01010de13b8)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO                             + PCI[12.0] - GPU/69000 (5)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO                                           + NVL[264.0] - NVS/0
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO + SYS[9.0] - CPU/1
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO CPU/1 (1/1/2)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO + PCI[12.0] - PCI/A6000 (1000c01010000000)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO               + PCI[12.0] - PCI/AE000 (1000c01010de13b8)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO                             + PCI[12.0] - GPU/B0000 (6)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO                                           + NVL[264.0] - NVS/0
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO + PCI[12.0] - PCI/CF000 (1000c01010000000)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO               + PCI[12.0] - PCI/D6000 (1000c01010de13b8)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO                             + PCI[12.0] - GPU/D8000 (7)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO                                           + NVL[264.0] - NVS/0
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO + PCI[12.0] - NIC/99000
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO               + NET[3.1] - NET/1 (1ef10003ebc008/1/3.125000)
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO + SYS[9.0] - CPU/0
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO ==========================================
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO GPU/46000 :GPU/46000 (0/5000.000000/LOC) GPU/69000 (2/264.000000/NVL) GPU/B0000 (2/264.000000/NVL) GPU/D8000 (2/264.000000/NVL) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (4/12.000000/PXB) NET/1 (6/3.125000/SYS) 
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO GPU/69000 :GPU/46000 (2/264.000000/NVL) GPU/69000 (0/5000.000000/LOC) GPU/B0000 (2/264.000000/NVL) GPU/D8000 (2/264.000000/NVL) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (6/12.000000/PXN) NET/1 (6/3.125000/SYS) 
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO GPU/B0000 :GPU/46000 (2/264.000000/NVL) GPU/69000 (2/264.000000/NVL) GPU/B0000 (0/5000.000000/LOC) GPU/D8000 (2/264.000000/NVL) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (6/12.000000/PXN) NET/1 (5/3.125000/PHB) 
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO GPU/D8000 :GPU/46000 (2/264.000000/NVL) GPU/69000 (2/264.000000/NVL) GPU/B0000 (2/264.000000/NVL) GPU/D8000 (0/5000.000000/LOC) CPU/0 (4/9.000000/SYS) CPU/1 (3/12.000000/PHB) NET/0 (6/12.000000/PXN) NET/1 (5/3.125000/PHB) 
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO NET/0 :GPU/46000 (4/12.000000/PXB) GPU/69000 (6/12.000000/PHB) GPU/B0000 (7/9.000000/SYS) GPU/D8000 (7/9.000000/SYS) CPU/0 (3/12.000000/PHB) CPU/1 (4/9.000000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (6/3.125000/SYS) 
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO NET/1 :GPU/46000 (6/3.125000/SYS) GPU/69000 (6/3.125000/SYS) GPU/B0000 (5/3.125000/PHB) GPU/D8000 (5/3.125000/PHB) CPU/0 (3/3.125000/SYS) CPU/1 (2/3.125000/PHB) NET/0 (6/3.125000/SYS) NET/1 (0/5000.000000/LOC) 
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 12.000000/12.000000, type NVL/PXN, sameChannels 1
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO Pattern 1, crossNic 0, nChannels 1, bw 24.000000/12.000000, type NVL/PXN, sameChannels 1
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
SH-IDC1-10-142-6-117:6149:6149 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 0, bw 0.000000/0.000000, type NVL/PIX, sameChannels 1
.....

sjeaugey commented 1 year ago

Thanks for the report.

Your alltoall implementation is the canonical implementation and as such should get the best performance -- no problem on that front.

Since you have a single NIC for 8 GPUs, you should get approximately 3 GB/s per GPU. With PXN enabled all traffic will go through NVLink to the GPU closer to the NIC (GPU 4), which will also only use a single CPU thread for all traffic. While this is more efficient in general, it could cause instability if the thread is competing for CPU time. Disabling PXN with NCCL_P2P_PXN_LEVEL=0 can be interesting to confirm whether this is related to CPU scheduling or not.

Otherwise, the instability could come from routing in the fabric. Enabling adaptive routing on IB can make a big difference for alltoall performance stability. You should ensure adaptive routing is enabled in the IB fabric configuration. You may need to select the right IB Service level with NCCL_IB_SL if adaptive routing is only enabled by the system administrator on specific SLs and not the default SL 0.

Few other questions:

What does MoE stand for? How does that affect the communication pattern?
Can you reproduce the issue with the alltoall_perf test from the NCCL tests? If so, when you run with -b 8 -e 4G -f 2, do you see temporal instability (within the same run) or run-to-run instability ?

de1star commented 1 year ago

Hi @sjeaugey , thanks for your advice, I will check them on my cluster. Here I provide some more details. What does MoE stand for? How does that affect the communication pattern?
MoE stands for mixture of experts, which is used to increase the amount of model parameters but does not lead to more calculation. Each process will have a unique expert module, and results of gate functions (always before the expert function) will tell each rank where to go (ncclSend), then each rank will also receive from others (ncclReceive). When I train the model with more GPUs, the number of experts also becomes larger.

Can you reproduce the issue with the alltoall_perf test from the NCCL tests? If so, when you run with -b 8 -e 4G -f 2, do you see temporal instability (within the same run) or run-to-run instability ?
I did not try that, cause the instability did not occur when I used 320 GPUs, it occurred when I used 784 GPUs. I would run alltoall_perf test the next time to see whether there is busbw instability.

Jack47 commented 1 year ago

hi @sjeaugey , thanks for quick reply, I'm wondering what does NCCL_IB_SL mean? we are using baremetal A100 machines, not VM. The link under nccl doc is broken:

sjeaugey commented 1 year ago

Indeed, it seems the IB spec is no longer available for download. That's annoying.

In any case, the Infiniband Service Level is a way to separate traffic on the IB fabric. Each SL will map to a given Virtual Lane (VL) in the NIC and Switches and use a different hardware path so that traffic on one VL never blocks traffic on another VL. That's how for example you can avoid interference between compute traffic and filesystems if both use IB.

When configuring the IB Subnet Manager, you can define which SL will have adaptive routing enabled or not. See section 4.3 of this article:

ar_sl_mask  Bitmask of service levels (SLs) on which the    Default: 0xFFFF
                AR is enabled. Least Significant Bit indicates
                SL0, etc.

Jack47 commented 1 year ago

@de1star may we update here for the environment variables we want to tune? we can ask sjeaugey for help to figure out the priority.

sjeaugey commented 1 year ago

Please note the default value for ar_sl_mask is 0xFFFF, which means adaptive routing is enabled on all SLs, no need to set NCCL_IB_SL. What you want to check is how your IB subnet manager is configured, whether ar_sl_mask has been modified or not, and what routing algorithm it uses (one that supports ar or not, see section 4.1, e.g. using the ar_updn or ar_ftree routing algorithms).

Jack47 commented 1 year ago

thanks for kindly reminder, we will check it on opensm to see where it's using `ar*`

NVIDIA / nccl

all2all time consuming is fluctuating when nodes become large #780