lmdeploy pipe 部署internvl2-40B batch_infer出错[Bug]

hitzhu commented 1 month ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

batch_infer出错，但是只推理一个没问题

main() File "/checkpoint/binary/train_package/./test2.py", line 47, in main batch_out = batch_infer(batch_input) File "/checkpoint/binary/train_package/./test2.py", line 23, in batch_infer batch_out = pipe(batch_input) File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 123, in call return super().call(prompts, kwargs) File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 305, in call return self.batch_infer(prompts, File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 109, in batch_infer return super().batch_infer(prompts, kwargs) File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 429, in batch_infer _get_event_loop().run_until_complete(gather()) File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 426, in gather await asyncio.gather( File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 411, in _inner_call async for out in generator: File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 635, in generate response, state = self.tokenizer.detokenize_incrementally( File "/root/.local/lib/python3.10/site-packages/lmdeploy/tokenizer.py", line 642, in detokenize_incrementally return self.model.detokenize_incrementally( File "/root/.local/lib/python3.10/site-packages/lmdeploy/tokenizer.py", line 460, in detokenize_incrementally new_tokens = tokenizer.convert_ids_to_tokens( File "/root/.local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 412, in convert_ids_to_tokens tokens.append(self._tokenizer.id_to_token(index)) OverflowError: out of range integral type conversion attempted

Reproduction

rt

Environment

lmd==0.5.3
A100*2

Error traceback

main()
File "/checkpoint/binary/train_package/./test2.py", line 47, in main
batch_out = batch_infer(batch_input)
File "/checkpoint/binary/train_package/./test2.py", line 23, in batch_infer
batch_out = pipe(batch_input)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 123, in __call__
return super().__call__(prompts, **kwargs)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 305, in __call__
return self.batch_infer(prompts,
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 109, in batch_infer
return super().batch_infer(prompts, **kwargs)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 429, in batch_infer
_get_event_loop().run_until_complete(gather())
File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 426, in gather
await asyncio.gather(
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 411, in _inner_call
async for out in generator:
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 635, in generate
response, state = self.tokenizer.detokenize_incrementally(
File "/root/.local/lib/python3.10/site-packages/lmdeploy/tokenizer.py", line 642, in detokenize_incrementally
return self.model.detokenize_incrementally(
File "/root/.local/lib/python3.10/site-packages/lmdeploy/tokenizer.py", line 460, in detokenize_incrementally
new_tokens = tokenizer.convert_ids_to_tokens(
File "/root/.local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 412, in convert_ids_to_tokens
tokens.append(self._tokenizer.id_to_token(index))
OverflowError: out of range integral type conversion attempted

irexyc commented 1 month ago

请提供复现的方式，另外internvl2-40B 是官方模型还是你微调的模型？

whalefa1I commented 2 weeks ago

请提供复现的方式，另外internvl2-40B 是官方模型还是你微调的模型？

官方模型，报错一致，启动命令

lmdeploy serve api_server ./models/InternVL2-40B --backend turbomind --server-port 23333 --session-len 8192 --model-name internvl2-internlm2 --log-level INFO --tp 2

client 脚本

import requests
query='''整体概括一下图片表现的内容'''
url = 'http://0.0.0.0:23333/v1/chat/completions'
data = {
    "model": "internvl2-internlm2",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": query
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
                    }
                }
            ]
        }
    ]
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

如果开--tp 1的话会oom，A800

irexyc commented 2 weeks ago

@whalefa1I

启动模型后，显存会根据输入略微增加（和batch 跟 session 的长度相关），可以通过设置 --cache_max_entry_count 0.5 来减少kvcache显存的占用，预留更多个buffer。

lvhan028 commented 2 weeks ago

40B 的模型，内存要占用 80G，单卡A800是肯定不行的。2卡合适。从前文的日志看，感觉是 index 超出词表大小了。@irexyc 能复现么？

tokens.append(self._tokenizer.id_to_token(index))
OverflowError: out of range integral type conversion attempted

InternLM / lmdeploy