Open hitzhu opened 1 month ago
请提供复现的方式,另外internvl2-40B 是官方模型还是你微调的模型?
请提供复现的方式,另外internvl2-40B 是官方模型还是你微调的模型?
官方模型,报错一致,启动命令
lmdeploy serve api_server ./models/InternVL2-40B --backend turbomind --server-port 23333 --session-len 8192 --model-name internvl2-internlm2 --log-level INFO --tp 2
client 脚本
import requests
query='''整体概括一下图片表现的内容'''
url = 'http://0.0.0.0:23333/v1/chat/completions'
data = {
"model": "internvl2-internlm2",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": query
},
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
}
}
]
}
]
}
response = requests.post(url, headers=headers, json=data)
print(response.json())
如果开--tp 1的话会oom,A800
@whalefa1I
启动模型后,显存会根据输入略微增加(和batch 跟 session 的长度相关),可以通过设置 --cache_max_entry_count 0.5
来减少kvcache显存的占用,预留更多个buffer。
40B 的模型,内存要占用 80G,单卡A800是肯定不行的。2卡合适。 从前文的日志看,感觉是 index 超出词表大小了。@irexyc 能复现么?
tokens.append(self._tokenizer.id_to_token(index))
OverflowError: out of range integral type conversion attempted
Checklist
Describe the bug
batch_infer出错,但是只推理一个没问题
main() File "/checkpoint/binary/train_package/./test2.py", line 47, in main batch_out = batch_infer(batch_input) File "/checkpoint/binary/train_package/./test2.py", line 23, in batch_infer batch_out = pipe(batch_input) File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 123, in call return super().call(prompts, kwargs) File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 305, in call return self.batch_infer(prompts, File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/vl_async_engine.py", line 109, in batch_infer return super().batch_infer(prompts, kwargs) File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 429, in batch_infer _get_event_loop().run_until_complete(gather()) File "/opt/conda/envs/python3.10.13/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 426, in gather await asyncio.gather( File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 411, in _inner_call async for out in generator: File "/root/.local/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 635, in generate response, state = self.tokenizer.detokenize_incrementally( File "/root/.local/lib/python3.10/site-packages/lmdeploy/tokenizer.py", line 642, in detokenize_incrementally return self.model.detokenize_incrementally( File "/root/.local/lib/python3.10/site-packages/lmdeploy/tokenizer.py", line 460, in detokenize_incrementally new_tokens = tokenizer.convert_ids_to_tokens( File "/root/.local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 412, in convert_ids_to_tokens tokens.append(self._tokenizer.id_to_token(index)) OverflowError: out of range integral type conversion attempted
Reproduction
rt
Environment
Error traceback