[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
2024-10-09 04:07:10,136 - lmdeploy - WARNING - archs.py:53 - Fallback to pytorch engine because `../Qwen2-VL-7B-Instruct` not supported by turbomind engine.
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
HINT: Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:12345 in a browser for detailed api usage!!!
INFO: Started server process [6550]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:12345 (Press CTRL+C to quit)
INFO: 127.0.0.1:35220 - "GET /v1/models HTTP/1.1" 200 OK
Exception in callback _raise_exception_on_finish(<Future finis...-variables)')>) at /home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py:20
handle: <Handle _raise_exception_on_finish(<Future finis...-variables)')>) at /home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py:20>
Traceback (most recent call last):
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 27, in _raise_exception_on_finish
raise e
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 23, in _raise_exception_on_finish
task.result()
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 169, in forward
outputs = self.model.forward(*func_inputs)
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/model/qwen2.py", line 102, in forward
image_embeds = self.model.visual(pixel_values,
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 1128, in forward
hidden_states = blk(hidden_states, cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb)
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 431, in forward
hidden_states = hidden_states + self.attn(
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ma-user/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 404, in forward
attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 53.06 GiB. GPU 0 has a total capacity of 79.35 GiB of which 12.84 GiB is free. Process 89800 has 66.50 GiB memory in use. Of the allocated memory 65.56 GiB is allocated by PyTorch, and 421.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Checklist
Describe the bug
Qwen2-VL 7B按理说80G的显存是能跑下的,但实际部署时推理会OOM
Reproduction
lmdeploy serve api_server ../Qwen2-VL-7B-Instruct --server-port 12345
Environment
Error traceback