Closed zhyncs closed 2 weeks ago
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB NODE NODE SYS SYS SYS SYS 0-47,96-143 0
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB NODE NODE SYS SYS SYS SYS 0-47,96-143 0
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 NODE NODE PXB PXB SYS SYS SYS SYS 0-47,96-143 0
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 NODE NODE PXB PXB SYS SYS SYS SYS 0-47,96-143 0
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB NODE NODE 48-95,144-191 1
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB NODE NODE 48-95,144-191 1
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS NODE NODE PXB PXB 48-95,144-191 1
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS NODE NODE PXB PXB 48-95,144-191 1
mlx5_0 PXB PXB NODE NODE SYS SYS SYS SYS X PIX NODE NODE SYS SYS SYS SYS
mlx5_1 PXB PXB NODE NODE SYS SYS SYS SYS PIX X NODE NODE SYS SYS SYS SYS
mlx5_2 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE X PIX SYS SYS SYS SYS
mlx5_3 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE PIX X SYS SYS SYS SYS
mlx5_4 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS X PIX NODE NODE
mlx5_5 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS PIX X NODE NODE
mlx5_6 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE X PIX
mlx5_7 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
I discussed with @lzhangzz , and this is known and as expected. The communication overhead is relatively high at TP 8, so it is recommended to use Qwen 2 72B Instruct for TP 4.
Checklist
Describe the bug
In a recent ranking released by Hugging Face, Qwen 2 72b Instruct achieved a very good rank and has good support for Chinese. I used an 8 A100 GPUs to benchmark the performance of Qwen 2 72b Instruct under LMDeploy.
And I found that the performance at
tp 4
is much better thantp 8
. For benchmark details and results, please refer to https://github.com/zhyncs/sota-benchI don’t know if I missed some key information during the operation, or if there is a issue with the TP implementation in LMDeploy, or if the communication overhead at TP 8 brings down the performance gains. May you take a look? Thanks. @lzhangzz @irexyc @lvhan028
By the way, using Nsight Systems analysis when a single machine has multiple cards, which is not as convenient as a single card.
Reproduction
as mentioned above
Environment
Error traceback
No response