InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.7k stars 429 forks source link

[Bug] Internvl2-8B模型量化后推理速度减慢 #2616

Open guozhiyao opened 1 month ago

guozhiyao commented 1 month ago

Checklist

Describe the bug

我使用lmdeploy将Internvl2-8B模型用awq量化成INT4,用同一个query对量化前后模型进行推理,量化前generate_token_len=155,推理耗时4s;量化后generate_token_len=134,推理耗时29s,慢了7倍,这是正常的吗?

Reproduction

from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
import time
import os

gen_config = GenerationConfig(
    max_new_tokens=1024,
    top_p=1.0,
    temperature=0.2,
    top_k=5,
    repetition_penalty=1.0
)

context_len = 4096
# model_path = "internvl2-8b/v0-20241010-111326/checkpoint-5600-merged-awq-int4/"
# backend_config = TurbomindEngineConfig(model_format='awq', session_len=context_len)

model_path = "internvl2-8b/v0-20241010-111326/checkpoint-5600-merged/"
backend_config = TurbomindEngineConfig(session_len=context_len)

pipe = pipeline(model_path, backend_config=backend_config)

prompt = [
    {
        'role': 'user',
        'content': [
            {'type': 'text', 'text': '描述图片内容'},
            {'type': 'image_url', 'image_url': {'url': f'{url}'}}
        ]
    }
]

start_time = time.time()
answer = ""
for item in pipe.stream_infer(prompt, gen_config=gen_config):
    answer += item.text
    print(answer)
    print(item.generate_token_len)

print(time.time()-start_time)
os.system("nvidia-smi")

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA H20
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.99
GCC: gcc (GCC) 10.2.1 20200825 (Alibaba 10.2.1-3 2.17)
PyTorch: 2.4.0
PyTorch compiling details: PyTorch built with:
  - GCC 10.2
  - C++ Version: 201703
  - Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.4
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90,code=compute_90
  - CuDNN 90.0  (built against CUDA 12.3)
  - Build settings: BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.0.0, CXX_COMPILER=/bin/c++, CXX_FLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17 -fno-tree-vectorize -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=ON, USE_CUDNN=1, USE_CUSPARSELT=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=OFF, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.19.0a0+48b1edf
LMDeploy: 0.6.1+
transformers: 4.45.2
gradio: 5.0.1
fastapi: 0.115.0
pydantic: 2.9.2
triton: 3.0.0
NVIDIA Topology: 
        GPU0    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   NIC12   NIC13   NIC14   NIC15   NIC16   NIC17   NIC18   NIC19   NIC20   NIC21   NIC22   NIC23   NIC24   NIC25   NIC26      NIC27   NIC28   NIC29   NIC30   NIC31   NIC32   NIC33   NIC34   NIC35   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX                             N/A
NIC0    SYS      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS
NIC1    SYS     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS
NIC2    SYS     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS
NIC3    SYS     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS
NIC4    SYS     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS
NIC5    SYS     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS
NIC6    SYS     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS
NIC7    SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS
NIC8    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX
NIC9    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX
NIC10   PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX
NIC11   PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX
NIC12   PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX
NIC13   PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX
NIC14   PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX
NIC15   PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX
NIC16   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS
NIC17   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS
NIC18   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS
NIC19   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS
NIC20   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS
NIC21   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS
NIC22   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS
NIC23   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS
NIC24   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     PIXPIX     PIX     PIX     PIX     PIX     SYS     PIX     SYS     SYS
NIC25   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      PIXPIX     PIX     PIX     PIX     PIX     SYS     PIX     SYS     SYS
NIC26   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX      X PIX     PIX     PIX     PIX     PIX     SYS     PIX     SYS     SYS
NIC27   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX X      PIX     PIX     PIX     PIX     SYS     PIX     SYS     SYS
NIC28   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIXPIX      X      PIX     PIX     PIX     SYS     PIX     SYS     SYS
NIC29   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIXPIX     PIX      X      PIX     PIX     SYS     PIX     SYS     SYS
NIC30   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIXPIX     PIX     PIX      X      PIX     SYS     PIX     SYS     SYS
NIC31   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIXPIX     PIX     PIX     PIX      X      SYS     PIX     SYS     SYS
NIC32   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
NIC33   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIXPIX     PIX     PIX     PIX     PIX     SYS      X      SYS     SYS
NIC34   SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS
NIC35   PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11
  NIC12: mlx5_12
  NIC13: mlx5_13
  NIC14: mlx5_14
  NIC15: mlx5_15
  NIC16: mlx5_16
  NIC17: mlx5_17
  NIC18: mlx5_18
  NIC19: mlx5_19
  NIC20: mlx5_20
  NIC21: mlx5_21
  NIC22: mlx5_22
  NIC23: mlx5_23
  NIC24: mlx5_24
  NIC25: mlx5_25
  NIC26: mlx5_26
  NIC27: mlx5_27
  NIC28: mlx5_28
  NIC29: mlx5_29
  NIC30: mlx5_30
  NIC31: mlx5_31
  NIC32: mlx5_bond_0
  NIC33: mlx5_bond_1
  NIC34: mlx5_bond_2
  NIC35: mlx5_bond_3

Error traceback

No response

lvhan028 commented 1 month ago

不要只跑一条,int4 gemm 有tuning 过程的。 测大几千条请求试试。

guozhiyao commented 1 month ago

不要只跑一条,int4 gemm 有tuning 过程的。 测大几千条请求试试。

@lvhan028 我用非量化的跑10条,平均2秒一条,显存占用77G

image

我用量化的跑1000条,现在还是ping'jun平均22秒一条,显存占用也有77G

image image