InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.6k stars 420 forks source link

[Bug] 再更新了lmdeploy的版本后,chat turbomind命令炸显存 #1001

Closed Hongru0306 closed 9 months ago

Hongru0306 commented 9 months ago

Checklist

Describe the bug

6b857676a91e886304f529295b8dffd

Reproduction

lmdeploy convert internlm2-chat-7b  /root/share/temp/model_repos/internlm2-chat-7b/
lmdeploy chat turbomind ./workspace

Environment

Levenshtein               0.23.0
lit                       17.0.6
lmdeploy                  0.2.0

Error traceback

model_source: workspace
01/19 15:19:51 - turbomind - WARNING - loading from workspace, ignore args ['cache_max_entry_count', 'tp'] please use TurbomindEngineConfig or modify config.ini
01/19 15:19:53 - lmdeploy - WARNING - Can not find tokenizer.json. It may take long time to initialize the tokenizer.
model_config:

[llama]
model_name = internlm2-chat-7b
tensor_para_size = 1
head_num = 32
kv_head_num = 8
vocab_size = 92544
num_layer = 32
inter_size = 14336
norm_eps = 1e-05
attn_bias = 0
start_id = 1
end_id = 2
session_len = 32776
weight_type = fp16
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 0
max_batch_size = 64
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.5
cache_block_seq_len = 128
cache_chunk_size = -1
num_tokens_per_iter = 0
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_logn_attn = 0

Input chat template with model_name is None. Forcing to use internlm2-chat-7b
[WARNING] gemm_config.in is not found; using default GEMM algo
Exception in thread Thread-2 (_create_model_instance):
Traceback (most recent call last):
  File "/share/conda_envs/internlm-base/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/share/conda_envs/internlm-base/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 486, in _create_model_instance
    model_inst = self.tm_model.model_comm.create_model_instance(
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231 

session 1

double enter to end input >>> 你好

[UNUSED_TOKEN_146]system
You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]user
你好[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]assistant
 Traceback (most recent call last):
  File "/root/.local/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 18, in run
    args.run(args)
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/cli/chat.py", line 86, in turbomind
    main(**kwargs)
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/turbomind/chat.py", line 117, in main
    for outputs in generator.stream_infer(
  File "/root/.local/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 798, in stream_infer
    self.model_insts[0].register_callback(self._forward_callback)
AttributeError: 'NoneType' object has no attribute 'register_callback'
lvhan028 commented 9 months ago

把cache_max_entry_count调小一些。用的是什么显卡呢

Hongru0306 commented 9 months ago

把cache_max_entry_count调小一些。用的是什么显卡呢

用的是实战营的1/4 100,调成0.1没问题了,感谢回复!

miaoerduo commented 9 months ago

把cache_max_entry_count调小一些。用的是什么显卡呢

成功了,感谢。

hhnshiyi commented 9 months ago

cache_max_entry_count 请问这个参数在那个文件调节呢