InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.33k stars 390 forks source link

[Bug] Turbomind 后端显存占用翻倍 #1758

Closed QwertyJack closed 3 months ago

QwertyJack commented 3 months ago

Checklist

Describe the bug

使用最新main代码运行MiniCPM-Llama3-V-2_5 和 awq 4bit 量化版本均发现显存占用翻倍现象。

Reproduction

使用最新main代码运行MiniCPM-Llama3-V-2_5 和 awq 4bit 量化版本均发现显存占用翻倍现象。

# quant the model
$ lmdeploy lite auto_awq /data/models/MiniCPM-Llama3-V-2_5 --work-dir /data/models/MiniCPM-Llama3-V-2_5-awq

# run the server
$ lmdeploy serve api_server /data/models/MiniCPM-Llama3-V-2_5-awq --cache-max-entry-count 0.001 --model-format awq
...
INFO:     Started server process [2525483]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)

# check model size and CPU utilization
$ du -sh /data/model/MiniCPM-Llama3-V-2_5-awq
6.4G    /data/model/MiniCPM-Llama3-V-2_5-awq

$ nvidia-smi
Tue Jun 11 13:23:44 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:87:00.0 Off |                    0 |
| N/A   65C    P0             31W /   70W |   13590MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2541817      C   ...conda/envs/lmdeploy-test/bin/python      13588MiB |
+-----------------------------------------------------------------------------------------+

推理一切正常。

Environment

$ lmdeploy check_env
sys.platform: linux
Python: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: Tesla T4
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu121
LMDeploy: 0.4.2+d25b5c6
transformers: 4.41.1
gradio: Not Found
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.2.0

Error traceback

No response

lzhangzz commented 3 months ago

默认会把加载模型后空闲显存的 80% 用作 kv cache 缓存, 可以根据文档调整相应的设置

QwertyJack commented 3 months ago

已经指定了 --cache-max-entry-count 0.001

irexyc commented 3 months ago

这里加上这句话试试:(需要import dataclasses)

    for field in dataclasses.fields(TurbomindModelConfig):
        setattr(config, field.name, None)

原因应该是这里的cfg其实只有model_arch, session_len, weight_type, group_size 是有效的,update的话,会用cache-max-entry-count默认值0.8覆盖掉传进来的0.001

方便的话,可以提一个PR修一下这个问题。

QwertyJack commented 3 months ago

增加 setattr 修改之后,在 MiniCPM-V-2.5 上测试,显存明显降低,符合预期。

QwertyJack commented 3 months ago

Fixed by #1778. Close now. Many thanks to @irexyc