Open bltcn opened 3 months ago
May try to "--max-batch-size 1" If it doesn't work, you may go for vLLM. It will take a while to optimize memory in LMDeploy. Don't let it to block your work
May try to "--max-batch-size 1" If it doesn't work, you may go for vLLM. It will take a while to optimize memory in LMDeploy. Don't let it to block your work
2024-08-01 03:24:58,259 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='qwen2-72b-instruct', model_format='hf', tp=8, session_len=24000, max_batch_size=1, cache_max_entry_count=0.01, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-08-01 03:24:58,260 - lmdeploy - INFO - input chat_template_config=None 2024-08-01 03:24:58,540 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-08-01 03:24:58,540 - lmdeploy - INFO - model_source: hf_model 2024-08-01 03:24:58,540 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Device does not support bfloat16. Set float16 forcefully 2024-08-01 03:24:58,882 - lmdeploy - INFO - model_config:
[llama] model_name = qwen model_arch = Qwen2ForCausalLM tensor_para_size = 8 head_num = 64 kv_head_num = 8 vocab_size = 152064 num_layer = 80 inter_size = 29568 norm_eps = 1e-06 attn_bias = 1 start_id = 151643 end_id = 151645 session_len = 24000 weight_type = fp16 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 0 max_batch_size = 1 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.01 cache_block_seq_len = 64 cache_chunk_size = -1 enable_prefix_caching = False num_tokens_per_iter = 8192 max_prefill_iters = 3 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 0 max_position_embeddings = 32768 rope_scaling_factor = 0.0 use_dynamic_ntk = 0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =
[TM][WARNING] [LlamaTritonModel] max_context_token_num
= 24000.
2024-08-01 03:25:02,417 - lmdeploy - WARNING - get 4643 model params
Convert to turbomind format: 0%| | 0/80 [00:00<?, ?it/s]Traceback (most recent call last):
File "/opt/py38/bin/lmdeploy", line 33, in
Checklist
Describe the bug
8卡2080ti,22g显存版,使用vllm可以运行Qwen2-72B-Instruct,但是使用本系统不行,总是报内存溢出错误
Reproduction
命令行如下: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 lmdeploy serve api_server --model-name qwen2-72b-instruct --allow-origins * --tp 8 --log-level INFO --session-len 24000 --cache-max-entry-count 0.01 --model-format hf /root/.cache/Qwen2-72B-Instruct/
Environment
Error traceback