结果报错:
INFO 06-07 17:21:01 gptq_marlin.py:133] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 06-07 17:21:01 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/data/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
rank0: Traceback (most recent call last):
rank0: File "python3.10/runpy.py", line 196, in _run_module_as_main
rank0: return _run_code(code, main_globals, None,
rank0: File "python3.10/runpy.py", line 86, in _run_code
rank0: exec(code, run_globals)
rank0: File "python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 186, in rank0: engine = AsyncLLMEngine.from_engine_args(
rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
rank0: engine = cls(
rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in initrank0: self.engine = self._init_engine(*args, *kwargs)
rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
rank0: return engine_class(args, **kwargs)
rank0: File "python3.10/site-packages/vllm/engine/llm_engine.py", line 222, in initrank0: self.model_executor = executor_class(
rank0: File "python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init
rank0: File "python3.10/site-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor
rank0: File "python3.10/site-packages/vllm/worker/worker.py", line 121, in load_model
rank0: File "python3.10/site-packages/vllm/worker/model_runner.py", line 134, in load_model
rank0: self.model = get_model(
rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
rank0: return loader.load_model(model_config=model_config,
rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model
rank0: model = _initialize_model(model_config, self.load_config,
rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model
rank0: return model_class(config=model_config.hf_config,
rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 389, in initrank0: self.model = Qwen2MoeModel(config, cache_config, quant_config)
rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 349, in initrank0: self.layers = nn.ModuleList(rank00]: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 350, in
rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 290, in initrank0: self.mlp = Qwen2MoeSparseMoeBlock(config=config,
rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 114, in init
rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params
rank0: File "python3.10/site-packages/torch/nn/modules/module.py", line 1709, in getattrrank0: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
rank0: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'?
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
使用vllm0.4.3跑Qwen2-57B-A14B-Instruct-GPTQ-Int4模型时,直接报错,不知是vllm的问题还是Qwen2的问题,是否因为不支持量化Moe模型?
命令:python -m vllm.entrypoints.openai.api_server --model /data/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4 --max-model-len 8192 --gpu-memory-utilization 0.9
结果报错: INFO 06-07 17:21:01 gptq_marlin.py:133] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. INFO 06-07 17:21:01 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/data/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/models/Qwen2-57B-A14B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. rank0: Traceback (most recent call last): rank0: File "python3.10/runpy.py", line 196, in _run_module_as_main rank0: return _run_code(code, main_globals, None, rank0: File "python3.10/runpy.py", line 86, in _run_code rank0: exec(code, run_globals) rank0: File "python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 186, in
rank0: engine = AsyncLLMEngine.from_engine_args(
rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
rank0: engine = cls(
rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init
rank0: self.engine = self._init_engine(*args, *kwargs)
rank0: File "python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
rank0: return engine_class(args, **kwargs)
rank0: File "python3.10/site-packages/vllm/engine/llm_engine.py", line 222, in init
rank0: self.model_executor = executor_class(
rank0: File "python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init
rank0: File "python3.10/site-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor
rank0: File "python3.10/site-packages/vllm/worker/worker.py", line 121, in load_model
rank0: File "python3.10/site-packages/vllm/worker/model_runner.py", line 134, in load_model rank0: self.model = get_model( rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model rank0: return loader.load_model(model_config=model_config, rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model rank0: model = _initialize_model(model_config, self.load_config, rank0: File "python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model rank0: return model_class(config=model_config.hf_config, rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 389, in init rank0: self.model = Qwen2MoeModel(config, cache_config, quant_config) rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 349, in init rank0: self.layers = nn.ModuleList(rank00]: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 350, in
rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 290, in init rank0: self.mlp = Qwen2MoeSparseMoeBlock(config=config, rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 114, in init
rank0: File "python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params
rank0: File "python3.10/site-packages/torch/nn/modules/module.py", line 1709, in getattr rank0: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") rank0: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'?
期望行为 | Expected Behavior
只有Moe量化模型有这个问题
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
备注 | Anything else?
No response