InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.98k stars 363 forks source link

[Bug] qwen1.5-14b-awq服务部署error #1532

Open Vincent131499 opened 3 months ago

Vincent131499 commented 3 months ago

Checklist

Describe the bug

使用lmdeploy lite auto_awq生成awq模型; 然后使用tp=2部署服务,会出现如下error:

2024-04-30 08:25:44,179 - lmdeploy - INFO - input backend=turbomind, backend_config=TurbomindEngineConfig(model_name='qwen', model_format='awq', tp=2, session_len=16384, max_batch_size=4, cache_max_entry_count=0.8, cache_block_seq_len=64, quant_policy=8, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1) 2024-04-30 08:25:44,179 - lmdeploy - INFO - input chat_template_config=None 2024-04-30 08:25:45,243 - lmdeploy - INFO - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None) 2024-04-30 08:25:45,243 - lmdeploy - WARNING - model_source: hf_model 2024-04-30 08:25:45,243 - lmdeploy - WARNING - model_name is deprecated in TurbomindEngineConfig and has no effect Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-04-30 08:25:46,965 - lmdeploy - WARNING - model_config:

[llama] model_name = qwen tensor_para_size = 2 head_num = 40 kv_head_num = 40 vocab_size = 152064 num_layer = 40 inter_size = 13696 norm_eps = 1e-06 attn_bias = 1 start_id = 151643 end_id = 151645 session_len = 16384 weight_type = int4 rotary_embedding = 128 rope_theta = 1000000.0 size_per_head = 128 group_size = 128 max_batch_size = 4 max_context_token_num = 1 step_length = 1 cache_max_entry_count = 0.8 cache_block_seq_len = 64 cache_chunk_size = -1 num_tokens_per_iter = 8192 max_prefill_iters = 2 extra_tokens_per_iter = 0 use_context_fmha = 1 quant_policy = 8 max_position_embeddings = 32768 rope_scaling_factor = 0.0 use_dynamic_ntk = 0 use_logn_attn = 0 lora_policy = lora_r = 0 lora_scale = 0.0 lora_max_wo_r = 0 lora_rank_pattern = lora_scale_pattern =

[TM][INFO] Set logger level by INFO [TM][WARNING] [LlamaTritonModel] max_context_token_num = 16384. [TM][INFO] Set logger level by INFO [TM][INFO] Set logger level by INFO 2024-04-30 08:25:47,770 - lmdeploy - WARNING - get 883 model params Convert to turbomind format: 0%| | 0/40 [00:00<?, ?it/s]Traceback (most recent call last): File "/opt/py38/bin/lmdeploy", line 11, in load_entry_point('lmdeploy', 'console_scripts', 'lmdeploy')() File "/opt/lmdeploy/lmdeploy/cli/entrypoint.py", line 37, in run args.run(args) File "/opt/lmdeploy/lmdeploy/cli/serve.py", line 283, in api_server run_api_server(args.model_path, File "/opt/lmdeploy/lmdeploy/serve/openai/api_server.py", line 1217, in serve VariableInterface.async_engine = pipeline_class( File "/opt/lmdeploy/lmdeploy/serve/async_engine.py", line 206, in init self._build_turbomind(model_path=model_path, File "/opt/lmdeploy/lmdeploy/serve/async_engine.py", line 254, in _build_turbomind self.engine = tm.TurboMind.from_pretrained( File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 396, in from_pretrained return cls(model_path=pretrained_model_name_or_path, File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 170, in init self.model_comm = self._from_hf(model_source=model_source, File "/opt/lmdeploy/lmdeploy/turbomind/turbomind.py", line 305, in _from_hf output_model.export() File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/base.py", line 273, in export self.export_transformer_block(bin, i) File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/w4.py", line 156, in export_transformer_block self.save_split(w2_sz, f'layers.{i}.feed_forward.w2.scales_zeros', 0) File "/opt/lmdeploy/lmdeploy/turbomind/deploy/target_model/base.py", line 246, in save_split assert tensor.shape[split_dim] % tp == 0 AssertionError

Reproduction

lmdeploy serve api_server ../pretrained-models/qwen1.5-14b-chat-w4-lmdeploy/ --backend turbomind --model-format awq --log-level INFO --tp 2 --quant-policy 8

Environment

使用docker镜像:0.4.0

Error traceback

No response

Vincent131499 commented 3 months ago

tp=1部署服务运行正常

Vincent131499 commented 3 months ago

please help!

lvhan028 commented 3 months ago

assert tensor.shape[split_dim] % tp == 0

量化参数的tensor shape不能被tp整除,无法tp

shikimoon commented 2 months ago

assert tensor.shape[split_dim] % tp == 0

量化参数的tensor shape不能被tp整除,无法tp

碰到同样的问题了。请问要怎么解决呢

lvhan028 commented 2 months ago

现在还没有解决