[Bug] Qwen2-VL占用显存过大导致OOM

cmpute commented 1 month ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Qwen2-VL 7B按理说80G的显存是能跑下的，但实际部署时推理会OOM

Reproduction

lmdeploy serve api_server ../Qwen2-VL-7B-Instruct --server-port 12345

Environment

sys.platform: linux
Python: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.3) 9.4.0
PyTorch: 2.2.2+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.7  (built against CUDA 12.2)
    - Built with CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.17.2+cu121
LMDeploy: 0.6.1+
transformers: 4.45.2
gradio: 4.41.0
fastapi: 0.112.1
pydantic: 2.8.2
triton: 2.2.0

Error traceback

2024-10-09 04:07:10,136 - lmdeploy - WARNING - archs.py:53 - Fallback to pytorch engine because `../Qwen2-VL-7B-Instruct` not supported by turbomind engine.
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
HINT:    Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
INFO:     Started server process [6550]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:12345 (Press CTRL+C to quit)
INFO:     127.0.0.1:35220 - "GET /v1/models HTTP/1.1" 200 OK
Exception in callback _raise_exception_on_finish(<Future finis...-variables)')>) at /home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py:20
handle: <Handle _raise_exception_on_finish(<Future finis...-variables)')>) at /home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py:20>
Traceback (most recent call last):
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 27, in _raise_exception_on_finish
    raise e
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 23, in _raise_exception_on_finish
    task.result()
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 169, in forward
    outputs = self.model.forward(*func_inputs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/model/qwen2.py", line 102, in forward
    image_embeds = self.model.visual(pixel_values,
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1128, in forward
    hidden_states = blk(hidden_states, cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 431, in forward
    hidden_states = hidden_states + self.attn(
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 404, in forward
    attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 53.06 GiB. GPU 0 has a total capacity of 79.35 GiB of which 12.84 GiB is free. Process 89800 has 66.50 GiB memory in use. Of the allocated memory 65.56 GiB is allocated by PyTorch, and 421.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

jianliao commented 1 month ago

@cmpute 请问你的图片分辨率多少？换成小一点的图片可能就行。

我查看了Qwen在HF的文档，这里明确提到Model似乎能通过两个预设配置参数来自动resize图片的大小，请问在lmdeploy里面怎么能传入这中启动参数？我现在的启动脚本如下：

lmdeploy serve api_server --backend pytorch Qwen/Qwen2-VL-2B-Instruct

cmpute commented 1 month ago

@cmpute 请问你的图片分辨率多少？换成小一点的图片可能就行。

我查看了Qwen在HF的文档，这里明确提到Model似乎能通过两个预设配置参数来自动resize图片的大小，请问在lmdeploy里面怎么能传入这中启动参数？我现在的启动脚本如下：
lmdeploy serve api_server --backend pytorch Qwen/Qwen2-VL-2B-Instruct

图片挺大的，确实有可能，有空试下

irexyc commented 1 month ago

@jianliao @cmpute

https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/multi_modal/qwen2_vl.md

https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/vl/model/qwen2.py#L85-L87 这个文档里面有一些例子，支持设置 min_pixels/max_pixels, resized_height/resized_width

jianliao commented 1 month ago

以上文档里面指定min_pixels/max_pixels, resized_height/resized_width的例子是基于pipeline的，请问如果是serve模式，该如何设定？

Titan-p commented 1 month ago

以上文档里面指定min_pixels/max_pixels, resized_height/resized_width的例子是基于pipeline的，请问如果是serve模式，该如何设定？

request 中添加 { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "请描述下图片" }, { "type": "image_url", "image_url": { "max_pixels": "1000000", "url": IMAGE_URL } } ] } ], }

jianliao commented 1 month ago

@Titan-p 试过了，确实可行。就是用起来很不方便，不知道是否得通过扩展客户端聊天UI应用程序来给每个带图片请求私下增加这个像素限制属性。

有没有办法直接在Server端直接配置呢？

irexyc commented 1 month ago

@jianliao

pipeline 的示例中，message的格式就是openai的格式，使用server的时候传这个message就可以了。

server 端配置是指设置全局的最大最小像素么？目前没这个功能，只能通过改代码来控制。具体位置的在这里，可以在下面加一行比如

if 'max_pixels' not in item:
    item.update(dict(max_pixels=64 * 28 * 28))

Wiselnn570 commented 1 month ago

想请教一下目前这个仓库支持qwen2-vl输入视频吗 @irexyc @jianliao @Titan-p

InternLM / lmdeploy