InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.4k stars 395 forks source link

[Bug] Qwen2-VL占用显存过大导致OOM #2565

Open cmpute opened 4 days ago

cmpute commented 4 days ago

Checklist

Describe the bug

Qwen2-VL 7B按理说80G的显存是能跑下的,但实际部署时推理会OOM

Reproduction

lmdeploy serve api_server ../Qwen2-VL-7B-Instruct --server-port 12345

Environment

sys.platform: linux
Python: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.3) 9.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.7  (built against CUDA 12.2)
    - Built with CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu121
LMDeploy: 0.6.1+
transformers: 4.45.2
gradio: 4.41.0
fastapi: 0.112.1
pydantic: 2.8.2
triton: 2.2.0

Error traceback

2024-10-09 04:07:10,136 - lmdeploy - WARNING - archs.py:53 - Fallback to pytorch engine because `../Qwen2-VL-7B-Instruct` not supported by turbomind engine.
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
HINT:    Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
INFO:     Started server process [6550]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:12345 (Press CTRL+C to quit)
INFO:     127.0.0.1:35220 - "GET /v1/models HTTP/1.1" 200 OK
Exception in callback _raise_exception_on_finish(<Future finis...-variables)')>) at /home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py:20
handle: <Handle _raise_exception_on_finish(<Future finis...-variables)')>) at /home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py:20>
Traceback (most recent call last):
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 27, in _raise_exception_on_finish
    raise e
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 23, in _raise_exception_on_finish
    task.result()
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 169, in forward
    outputs = self.model.forward(*func_inputs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/model/qwen2.py", line 102, in forward
    image_embeds = self.model.visual(pixel_values,
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1128, in forward
    hidden_states = blk(hidden_states, cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 431, in forward
    hidden_states = hidden_states + self.attn(
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 404, in forward
    attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 53.06 GiB. GPU 0 has a total capacity of 79.35 GiB of which 12.84 GiB is free. Process 89800 has 66.50 GiB memory in use. Of the allocated memory 65.56 GiB is allocated by PyTorch, and 421.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
jianliao commented 2 days ago

@cmpute 请问你的图片分辨率多少?换成小一点的图片可能就行。

我查看了Qwen在HF的文档,这里明确提到Model似乎能通过两个预设配置参数来自动resize图片的大小,请问在lmdeploy里面怎么能传入这中启动参数?我现在的启动脚本如下:

lmdeploy serve api_server --backend pytorch Qwen/Qwen2-VL-2B-Instruct
cmpute commented 1 day ago

@cmpute 请问你的图片分辨率多少?换成小一点的图片可能就行。

我查看了Qwen在HF的文档,这里明确提到Model似乎能通过两个预设配置参数来自动resize图片的大小,请问在lmdeploy里面怎么能传入这中启动参数?我现在的启动脚本如下:

lmdeploy serve api_server --backend pytorch Qwen/Qwen2-VL-2B-Instruct

图片挺大的,确实有可能,有空试下