NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

Performance lack of NCCL Test #201

Open shengode503 opened 4 months ago

shengode503 commented 4 months ago

Hi,

Firstly, appreciate publishing the open-source tool and the great support!! Currently, We encountered a lack performance issues while running the NCCL Test in the KVM environment on dual-node. The performance is significantly lower than the expectation. Please advise us on how to improve it. Thanks

[System] System: 2x Supermicro SYS-420GP-TNAR+ CPU: node1: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz node2: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
GPU: 8x NVIDIA A100-SXM4-80GB (per node) IB cards: 1xMCX623106AC-CDAT, 1xMCX653106A-HDAT Ubuntu: 22.04

[KVM] QEMU/Hypervisor: 6.2.0 KVM configuration file: (attachment, kvm-cfg.xml) IB card: 2 cards, 2 ports per card, 1 VF per port, total 4 NICs

[Software in KVM] NVIDIA CUDA: 12.3 NVIDIA Driver: 545.23.08 NVIDIA MLNX Driver: 5.8-4.1.5.0 NVIDIA Fabric Manager: 545.23.08 NVIDIA NCCL: v2.20.3 UCX: v1.14.0 OpenMPI: v4.1.6 The GPUDirect has been enabled through: sudo modprobe nvidia-peermem

[NVIDIA Perftest] The Perftest has been done to evaluate the performance in KVM. Howerver, the performace is lower than the expectation. A single IB card(dual-port – 2 VFs) has been passthrough to the KVM. The ib_write_bw has been done with all the GPU and the IB devices. Without any tuning, the performance we got is around 85 Gb/s. (attachment, perftest.zip)

[NVIDIA NCCL Test] The NCCL Test has been done on both bare metal and KVM.

The performance of 1 IB card(2 ports per card, 1 VF per port, total 2 NICs) is around 17 GB/s. (theoretical performance: 24 GB/s, we got ~=21 GB/s on bare metal) The performance of 2 IB cards(2 ports per card, 1 VF per port, total 4 NICs) is around 27 GB/s. (theoretical performance: 48 GB/s)

[BIOS configuration] bios-cfg

[KVM lspci topo] lspci_tvv

[nvidia-smi topo] nvidia-smi_topo

[IB VFs] vfs perftest_logs.zip

Best regards, Kevin

sjeaugey commented 4 months ago

Enabling ACS on the PCI switch is going to hurt performance since all traffic will have to go back to the root complex. You should first disable ACS on the PCI switch, then run perftests and NCCL tests baremetal, check you get the right performance.

Then, you can re-enable ACS, enable ATS in the NIC, and see if you can get the full performance inside the VM.

shengode503 commented 4 months ago

Hi Sylvain,

Thanks for the support! We've re-done the perftest on bare metal per the recommendations(disabled ACS). The attachment is the log, and the snapshot is the nvidia-smi topo. Currently, we're preparing the results of nccl-test(bare-metal/vm) and perftest(vm). Will update the results as soon as possible. Thanks!

image

perftest_logs_0313.zip

Best regards, Kevin

shengode503 commented 4 months ago

Hi Sylvain,

Here are the additional results we collected. We did the experiments with three different settings. All the logs are in the attachments(test-logs_0315.zip). Please help us to check it. Thanks!

Currently, We think the bare-metal results are normal. However, the others are lower than the expectation. Could you please help us check if the KVM configuration and the PCI topo that we used are correct? Also, what is the recommended command to execute the NCCL test? Below is the command that we currently use. Thanks

mpirun \
 -x NCCL_DEBUG=INFO \
 -x NCCL_IB_HCA=mlx5 \
 -x NCCL_SOCKET_IFNAME=<ifname> \
 -x NCCL_IB_MERGE_VFS=0 \
 --bind-to none \
 -np 16 -host "<node1-ip>:8,<node2-ip>:8" ./all_reduce_perf -b 8  -f 2 -e 8G -g 1

test-logs_0315.zip

Best regards, Kevin

sjeaugey commented 4 months ago

Everything I know is in my comment above. Unfortunately, I'm not expert at debugging PCIe config and VM hypervisor setup.

liayan commented 2 months ago

Hi @shengode503, I was wondering how your tests look today. Also, have you tested enabling ACS and ATS cases as @sjeaugey suggested above?

We have the exact same issues here. We got the best result from bare metal without ACs, but we hit an issue when we tried to enable both ACS and ATS.