Qwen 2 72b Instruct tp 8 performance degradation

zhyncs commented 4 weeks ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

In a recent ranking released by Hugging Face, Qwen 2 72b Instruct achieved a very good rank and has good support for Chinese. I used an 8 A100 GPUs to benchmark the performance of Qwen 2 72b Instruct under LMDeploy.

And I found that the performance at tp 4 is much better than tp 8. For benchmark details and results, please refer to https://github.com/zhyncs/sota-bench

I don’t know if I missed some key information during the operation, or if there is a issue with the TP implementation in LMDeploy, or if the communication overhead at TP 8 brings down the performance gains. May you take a look? Thanks. @lzhangzz @irexyc @lvhan028

By the way, using Nsight Systems analysis when a single machine has multiple cards, which is not as convenient as a single card.

Reproduction

as mentioned above

Environment

LMDeploy version https://github.com/zhyncs/lmdeploy-build/releases/tag/c9c225f

Error traceback

No response

zhyncs commented 3 weeks ago

nvidia-smi topo -m

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  mlx5_6  mlx5_7  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     0-47,96-143     0
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     0-47,96-143     0
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     0-47,96-143     0
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     0-47,96-143     0
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    48-95,144-191   1
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    48-95,144-191   1
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     48-95,144-191   1
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     48-95,144-191   1
mlx5_0  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      PIX     NODE    NODE    SYS     SYS     SYS     SYS
mlx5_1  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     PIX      X      NODE    NODE    SYS     SYS     SYS     SYS
mlx5_2  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     SYS     SYS     SYS     SYS
mlx5_3  NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      SYS     SYS     SYS     SYS
mlx5_4  SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      PIX     NODE    NODE
mlx5_5  SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     PIX      X      NODE    NODE
mlx5_6  SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE     X      PIX
mlx5_7  SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     NODE    NODE    PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

zhyncs commented 2 weeks ago

I discussed with @lzhangzz , and this is known and as expected. The communication overhead is relatively high at TP 8, so it is recommended to use Qwen 2 72B Instruct for TP 4.

zhyncs commented 1 week ago

ref https://github.com/InternLM/lmdeploy/pull/2082

InternLM / lmdeploy