InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.17k stars 376 forks source link

lmdeploy pipe 部署internvl2-40B batch_infer出错[Bug] #2271

Open hitzhu opened 1 month ago

hitzhu commented 1 month ago

Checklist

Describe the bug

batch_infer出错,但是只推理一个没问题

main() File "/checkpoint/binary/train_package/./test2.py", line 47, in main batch_out = batch_infer(batch_input) File "/checkpoint/binary/train_package/./test2.py", line 23, in batch_infer batch_out = pipe(batch_input) File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 123, in call return super().call(prompts, kwargs) File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 305, in call return self.batch_infer(prompts, File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 109, in batch_infer return super().batch_infer(prompts, kwargs) File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 429, in batch_infer _get_event_loop().run_until_complete(gather()) File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 426, in gather await asyncio.gather( File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 411, in _inner_call async for out in generator: File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 635, in generate response, state = self.tokenizer.detokenize_incrementally( File "/root/.local/lib/python3.10/site-packages/lmdeploy/tokenizer.py", line 642, in detokenize_incrementally return self.model.detokenize_incrementally( File "/root/.local/lib/python3.10/site-packages/lmdeploy/tokenizer.py", line 460, in detokenize_incrementally new_tokens = tokenizer.convert_ids_to_tokens( File "/root/.local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 412, in convert_ids_to_tokens tokens.append(self._tokenizer.id_to_token(index)) OverflowError: out of range integral type conversion attempted

Reproduction

rt

Environment

lmd==0.5.3
A100*2

Error traceback

main()
File "/checkpoint/binary/train_package/./test2.py", line 47, in main
batch_out = batch_infer(batch_input)
File "/checkpoint/binary/train_package/./test2.py", line 23, in batch_infer
batch_out = pipe(batch_input)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 123, in __call__
return super().__call__(prompts, **kwargs)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 305, in __call__
return self.batch_infer(prompts,
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 109, in batch_infer
return super().batch_infer(prompts, **kwargs)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 429, in batch_infer
_get_event_loop().run_until_complete(gather())
File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 426, in gather
await asyncio.gather(
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 411, in _inner_call
async for out in generator:
File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 635, in generate
response, state = self.tokenizer.detokenize_incrementally(
File "/root/.local/lib/python3.10/site-packages/lmdeploy/tokenizer.py", line 642, in detokenize_incrementally
return self.model.detokenize_incrementally(
File "/root/.local/lib/python3.10/site-packages/lmdeploy/tokenizer.py", line 460, in detokenize_incrementally
new_tokens = tokenizer.convert_ids_to_tokens(
File "/root/.local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 412, in convert_ids_to_tokens
tokens.append(self._tokenizer.id_to_token(index))
OverflowError: out of range integral type conversion attempted
irexyc commented 1 month ago

请提供复现的方式,另外internvl2-40B 是官方模型还是你微调的模型?

whalefa1I commented 2 weeks ago

请提供复现的方式,另外internvl2-40B 是官方模型还是你微调的模型?

官方模型,报错一致,启动命令

lmdeploy serve api_server ./models/InternVL2-40B --backend turbomind --server-port 23333 --session-len 8192 --model-name internvl2-internlm2 --log-level INFO --tp 2

client 脚本

import requests
query='''整体概括一下图片表现的内容'''
url = 'http://0.0.0.0:23333/v1/chat/completions'
data = {
    "model": "internvl2-internlm2",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": query
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
                    }
                }
            ]
        }
    ]
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

如果开--tp 1的话会oom,A800

irexyc commented 2 weeks ago

@whalefa1I

启动模型后,显存会根据输入略微增加(和batch 跟 session 的长度相关),可以通过设置 --cache_max_entry_count 0.5 来减少kvcache显存的占用,预留更多个buffer。

lvhan028 commented 2 weeks ago

40B 的模型,内存要占用 80G,单卡A800是肯定不行的。2卡合适。 从前文的日志看,感觉是 index 超出词表大小了。@irexyc 能复现么?

tokens.append(self._tokenizer.id_to_token(index))
OverflowError: out of range integral type conversion attempted