[Bug] 多卡部署InternVL-Chat-V1-5时，在显存足够的情况下也会OutOfMemory。

Halflifefa commented 4 months ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.

Describe the bug

多卡部署InternVL-Chat-V1-5时，显存不是平均分配的，而是大部分在第一张卡上，这导致了两个问题。

推理时很容易因为第一张卡显存溢出报错，但其他卡还有大量显存。
如果第一张卡运行有其它小模型，假设剩余20G，仍会报错。

改问题无法通过修改cache-max-entry-count解决

Reproduction

支持利用剩余显存

Environment

lmdeploy 0.4.1
torchvision 0.17.2
timm 0.9.16

lmdeploy serve api_server InternVL-Chat-V1-5 --server-port 23333 --tp 4 --cache-max-entry-count 0.2

``` python
from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'Describe the image please',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8)
print(response)


### Error traceback

```Shell
HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
INFO:     Started server process [202693]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
INFO:     172.20.183.210:8636 - "GET / HTTP/1.1" 200 OK
INFO:     172.20.183.210:8636 - "GET /openapi.json HTTP/1.1" 200 OK
INFO:     127.0.0.1:41932 - "GET /v1/models HTTP/1.1" 200 OK
Exception in thread Thread-1 (_work_thread):
Traceback (most recent call last):
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 80, in _work_thread
    self.loop.run_until_complete(self._forward_loop())
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 96, in _forward_loop
    outputs = self.forward(inputs)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/engine.py", line 105, in forward
    outputs = self.model.forward(inputs)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 156, in forward
    return self._forward_func(images)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/vl/model/internvl.py", line 137, in _forward_v1_5
    outputs = self.model.extract_feature(outputs)
  File "/home/super/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_internvl_chat.py", line 250, in extract_feature
    vit_embeds = self.vision_model(
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/super/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 411, in forward
    encoder_outputs = self.encoder(
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/super/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 347, in forward
    layer_outputs = encoder_layer(
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/super/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 289, in forward
    hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states)) * self.ls1)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/super/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 246, in forward
    x = self._naive_attn(hidden_states) if not self.use_flash_attn else self._flash_attn(hidden_states)
  File "/home/super/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 217, in _naive_attn
    k = self.k_norm(k.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/super/.cache/huggingface/modules/transformers_modules/InternVL-Chat-V1-5/modeling_intern_vit.py", line 113, in forward
    variance = hidden_states.pow(2).mean(-1, keepdim=True)
  File "/home/super/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacity of 23.65 GiB of which 29.81 MiB is free. Including non-PyTorch memory, this process has 23.61 GiB memory in use. Of the allocated memory 10.91 GiB is allocated by PyTorch, and 290.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

|   0  NVIDIA GeForce RTX 4090        Off | 00000000:52:00.0 Off |                  Off |
| 30%   36C    P2              55W / 450W |  24187MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:56:00.0 Off |                  Off |
| 30%   31C    P8              12W / 450W |  12641MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off | 00000000:D1:00.0 Off |                  Off |
| 30%   32C    P8              16W / 450W |  12641MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off | 00000000:D5:00.0 Off |                  Off |
| 30%   34C    P8              23W / 450W |  12701MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |

irexyc commented 4 months ago

这种vision大模型分两个部分，视觉和LLM。LLM部分是均分的。vision模型都在0好卡上。均分vision模型是这个月的工作。可能要等一到两周时间

Halflifefa commented 4 months ago

这种vision大模型分两个部分，视觉和LLM。LLM部分是均分的。vision模型都在0好卡上。均分vision模型是这个月的工作。可能要等一到两周时间

好的，会出现这种问题估计也是InternVL-Chat-V1.5的视觉模块太大了，qwen-vl视觉模块就比较小

BIGBALLON commented 4 months ago

@irexyc 我测试4 V100 （32G）单卡跑InternVL-1.5-Chat-Int8，会爆显存，是不是也是类似的问题。但看起来如果文本部分均分的话，32G4应该够用才对，有什么指导建议吗，感谢

irexyc commented 4 months ago

@BIGBALLON

InternVL-1.5-Chat-Int8 这个不支持。lmdeploy 的量化，以及vision模型的均分，会在这个月支持。目前请先用fp的模型。

你可以看一下cache_max_entry_count 这个参数，用pipeline接口的话可以像这样，server的话你查一下相关的命令。 pipe = pipeline('model_path', backend_config=TurbomindEngineConfig(tp=4, cache_max_entry_count=0.5) # 另外这里可以尝试设置一下quant_policy，改为4或者8，进行kvcache的量化

InternVL-1.5-Chat 视觉部分12g，llm部分42G。llm部分均分的话，权重部分：0好卡12 + 11 = 23G，其余卡11g，剩下9G，之后按百分比（cache_max_entry_count）分配kv cache，以0.5算的话，最终1号卡27.5，其余卡15.5。运行时需要一些额外的buffer，vision模型多batch的话显存会明显增加，你可以把这里改成1。

BIGBALLON commented 4 months ago

@irexyc 多谢解答，got it。

期待量化后的模型以及 vision模型均分的feature！ great work！