model_source: workspace
01/19 15:19:51 - turbomind - WARNING - loading from workspace, ignore args ['cache_max_entry_count', 'tp'] please use TurbomindEngineConfig or modify config.ini
01/19 15:19:53 - lmdeploy - WARNING - Can not find tokenizer.json. It may take long time to initialize the tokenizer.
model_config:
[llama]
model_name = internlm2-chat-7b
tensor_para_size = 1
head_num = 32
kv_head_num = 8
vocab_size = 92544
num_layer = 32
inter_size = 14336
norm_eps = 1e-05
attn_bias = 0
start_id = 1
end_id = 2
session_len = 32776
weight_type = fp16
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 0
max_batch_size = 64
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.5
cache_block_seq_len = 128
cache_chunk_size = -1
num_tokens_per_iter = 0
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_logn_attn = 0
Input chat template with model_name is None. Forcing to use internlm2-chat-7b
[WARNING] gemm_config.in is not found; using default GEMM algo
Exception in thread Thread-2 (_create_model_instance):
Traceback (most recent call last):
File "/share/conda_envs/internlm-base/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/share/conda_envs/internlm-base/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 486, in _create_model_instance
model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231
session 1
double enter to end input >>> 你好
[UNUSED_TOKEN_146]system
You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]user
你好[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]assistant
Traceback (most recent call last):
File "/root/.local/bin/lmdeploy", line 8, in <module>
sys.exit(run())
File "/root/.local/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 18, in run
args.run(args)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/cli/chat.py", line 86, in turbomind
main(**kwargs)
File "/root/.local/lib/python3.10/site-packages/lmdeploy/turbomind/chat.py", line 117, in main
for outputs in generator.stream_infer(
File "/root/.local/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 798, in stream_infer
self.model_insts[0].register_callback(self._forward_callback)
AttributeError: 'NoneType' object has no attribute 'register_callback'
Checklist
Describe the bug
0.1.0
版本版本的时候只用14G显存就足够了,但是今天更新了lmdeploy的版本之后,无论是1还是internlm2-chat-7b在使用同样的命令转换成turbomind进行chat的时候都炸显存了。Reproduction
Environment
Error traceback