QwenLM / Qwen2.5

Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
9.73k stars 604 forks source link

AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight' #506

Closed jcxcer closed 3 months ago

jcxcer commented 5 months ago

使用vllm0.4.3跑Qwen2-57B-A14B-Instruct-GPTQ-Int4模型时,直接报错,不知是vllm的问题还是Qwen2的问题,是否因为不支持量化Moe模型?

命令:python -m vllm.entrypoints.openai.api_server --model /data/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4 --max-model-len 8192 --gpu-memory-utilization 0.9

结果报错: INFO 06-07 17:21:01 gptq_marlin.py:133] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. INFO 06-07 17:21:01 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/data/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. rank0: Traceback (most recent call last): rank0: File "python3.10/runpy.py", line 196, in _run_module_as_main rank0: return _run_code(code, main_globals, None, rank0: File "python3.10/runpy.py", line 86, in _run_code rank0: exec(code, run_globals) rank0: File "python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 186, in rank0: engine = AsyncLLMEngine.from_engine_args( rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args rank0: engine = cls( rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init rank0: self.engine = self._init_engine(*args, *kwargs) rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine rank0: return engine_class(args, **kwargs) rank0: File "python3.10/site-packages/vllm/engine/llm_engine.py", line 222, in init rank0: self.model_executor = executor_class( rank0: File "python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init

rank0: File "python3.10/site-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor

rank0: File "python3.10/site-packages/vllm/worker/worker.py", line 121, in load_model

rank0: File "python3.10/site-packages/vllm/worker/model_runner.py", line 134, in load_model rank0: self.model = get_model( rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model rank0: return loader.load_model(model_config=model_config, rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model rank0: model = _initialize_model(model_config, self.load_config, rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model rank0: return model_class(config=model_config.hf_config, rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 389, in init rank0: self.model = Qwen2MoeModel(config, cache_config, quant_config) rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 349, in init rank0: self.layers = nn.ModuleList(rank00]: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 350, in

rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 290, in init rank0: self.mlp = Qwen2MoeSparseMoeBlock(config=config, rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 114, in init

rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params

rank0: File "python3.10/site-packages/torch/nn/modules/module.py", line 1709, in getattr rank0: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") rank0: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'?

只有Moe量化模型有这个问题

运行环境 | Environment

jklj077 commented 5 months ago

Hi,

It is not possible to run quantized Qwen2MoE models with vllm right now (due to efficiency issues and we are working on it) and the error is expected.

cywuuuu commented 5 months ago

Hi,

It is not possible to run quantized Qwen2MoE models with vllm right now (due to efficiency issues and we are working on it) and the error is expected.

  • Qwen2MoE fp16 or bf16: transformers and vllm both okay
  • Qwen2MoE GPTQ: transformers only

hello, I am also facing this issue, will the situation be different if I awq quantize it myself, instead of using the huggingface published version Qwen2-57B-A14B-Instruct-GPTQ-Int4

jklj077 commented 5 months ago

I don't think so, in the sense that AWQ does not support quantize Qwen2MoE models (IIRC).

github-actions[bot] commented 4 months ago

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.