[BUG] AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'

jcxcer commented 5 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用vllm0.4.3跑Qwen2-57B-A14B-Instruct-GPTQ-Int4模型时，直接报错，不知是vllm的问题还是Qwen2的问题，是否因为不支持量化Moe模型？

命令：python -m vllm.entrypoints.openai.api_server --model /data/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4 --max-model-len 8192 --gpu-memory-utilization 0.9

结果报错： INFO 06-07 17:21:01 gptq_marlin.py:133] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. INFO 06-07 17:21:01 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/data/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. rank0: Traceback (most recent call last): rank0: File "python3.10/runpy.py", line 196, in _run_module_as_main rank0: return _run_code(code, main_globals, None, rank0: File "python3.10/runpy.py", line 86, in _run_code rank0: exec(code, run_globals) rank0: File "python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 186, in rank0: engine = AsyncLLMEngine.from_engine_args( rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args rank0: engine = cls( rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init rank0: self.engine = self._init_engine(*args, *kwargs) rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine rank0: return engine_class(args, **kwargs) rank0: File "python3.10/site-packages/vllm/engine/llm_engine.py", line 222, in init rank0: self.model_executor = executor_class( rank0: File "python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init

rank0: File "python3.10/site-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor

rank0: File "python3.10/site-packages/vllm/worker/worker.py", line 121, in load_model

rank0: File "python3.10/site-packages/vllm/worker/model_runner.py", line 134, in load_model rank0: self.model = get_model( rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model rank0: return loader.load_model(model_config=model_config, rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model rank0: model = _initialize_model(model_config, self.load_config, rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model rank0: return model_class(config=model_config.hf_config, rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 389, in init rank0: self.model = Qwen2MoeModel(config, cache_config, quant_config) rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 349, in init rank0: self.layers = nn.ModuleList(rank00]: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 350, in

rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 290, in init rank0: self.mlp = Qwen2MoeSparseMoeBlock(config=config, rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 114, in init

rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params

rank0: File "python3.10/site-packages/torch/nn/modules/module.py", line 1709, in getattr rank0: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") rank0: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'?

期望行为 | Expected Behavior

只有Moe量化模型有这个问题

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:Ubuntu 22.04
- Python:3.10
- Transformers:4.41
- PyTorch:2.3.0
- CUDA 12.1
- vllm:0.4.3

备注 | Anything else?

No response

liyunhan commented 5 months ago

+1,mark

whk6688 commented 5 months ago

+1

zhuang-maowei commented 5 months ago

+1 vllm 0.5.0.post1同样报错。 72B的int4量化可以正常运行

QwenLM / Qwen2.5