QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

[BUG] vLLM方式Qwen-72B-Chat-Int4量化多卡部署报错 #1141

Closed abbydev closed 6 months ago

abbydev commented 6 months ago

运行环境 | Environment

- OS:  Ubuntu 11.4.0-1ubuntu1~22.04
- Python: 3.10 (3.8同样报错)
- Transformers: 4.38.2
- PyTorch: 2.1.2+cu121
- CUDA: 12.1
- xformers: 0.0.23.post1
- vLLM: https://github.com/QwenLM/vllm-gptq
- GPU: 4090 * 4卡

报错信息

python -m fastchat.serve.vllm_worker --model-path /root/qwen/Qwen-72B-Chat-Int4 --trust-remote-code --tensor-parallel-size 4 --dtype float16
WARNING 03-12 14:46:51 config.py:140] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-03-12 14:46:53,043 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-12 14:46:53,138 INFO worker.py:1724 -- Started a local Ray instance.
INFO 03-12 14:46:54 llm_engine.py:72] Initializing an LLM engine with config: model='/root/qwen/Qwen-72B-Chat-Int4', tokenizer='/root/qwen/Qwen-72B-Chat-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=gptq, seed=0)
WARNING 03-12 14:46:54 tokenizer.py:66] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
2024-03-12 14:46:56 | ERROR | stderr | Traceback (most recent call last):
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/venv-qwen/lib/python3.8/runpy.py", line 194, in _run_module_as_main
2024-03-12 14:46:56 | ERROR | stderr |     return _run_code(code, main_globals, None,
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/venv-qwen/lib/python3.8/runpy.py", line 87, in _run_code
2024-03-12 14:46:56 | ERROR | stderr |     exec(code, run_globals)
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/venv-qwen/lib/python3.8/site-packages/fastchat/serve/vllm_worker.py", line 290, in <module>
2024-03-12 14:46:56 | ERROR | stderr |     engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/engine/async_llm_engine.py", line 486, in from_engine_args
2024-03-12 14:46:56 | ERROR | stderr |     engine = cls(parallel_config.worker_use_ray,
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-03-12 14:46:56 | ERROR | stderr |     self.engine = self._init_engine(*args, **kwargs)
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/engine/async_llm_engine.py", line 305, in _init_engine
2024-03-12 14:46:56 | ERROR | stderr |     return engine_class(*args, **kwargs)
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/engine/llm_engine.py", line 108, in __init__
2024-03-12 14:46:56 | ERROR | stderr |     self._init_workers_ray(placement_group)
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/engine/llm_engine.py", line 157, in _init_workers_ray
2024-03-12 14:46:56 | ERROR | stderr |     from vllm.worker.worker import Worker
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/worker/worker.py", line 10, in <module>
2024-03-12 14:46:56 | ERROR | stderr |     from vllm.model_executor import get_model, InputMetadata, set_random_seed
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/model_executor/__init__.py", line 2, in <module>
2024-03-12 14:46:56 | ERROR | stderr |     from vllm.model_executor.model_loader import get_model
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/model_executor/model_loader.py", line 10, in <module>
2024-03-12 14:46:56 | ERROR | stderr |     from vllm.model_executor.models import *
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/model_executor/models/__init__.py", line 1, in <module>
2024-03-12 14:46:56 | ERROR | stderr |     from vllm.model_executor.models.aquila import AquilaForCausalLM
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/model_executor/models/aquila.py", line 34, in <module>
2024-03-12 14:46:56 | ERROR | stderr |     from vllm.model_executor.layers.activation import SiluAndMul
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/model_executor/layers/activation.py", line 8, in <module>
2024-03-12 14:46:56 | ERROR | stderr |     from vllm.model_executor.layers.quantization import QuantizationConfig
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/model_executor/layers/quantization/__init__.py", line 3, in <module>
2024-03-12 14:46:56 | ERROR | stderr |     from vllm.model_executor.layers.quantization.awq import AWQConfig
2024-03-12 14:46:56 | ERROR | stderr |   File "/root/lab/vllm-gptq/vllm/model_executor/layers/quantization/awq.py", line 6, in <module>
2024-03-12 14:46:56 | ERROR | stderr |     from vllm import quantization_ops
2024-03-12 14:46:56 | ERROR | stderr | ImportError: /root/lab/vllm-gptq/vllm/quantization_ops.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefINS2_6SymIntEEESt8optionalINS2_10ScalarTypeEES6_INS2_6LayoutEES6_INS2_6DeviceEES6_IbES6_INS2_12MemoryFormatEE

尝试

# 尝试过一下方法,并未解决,vllm使用依然是https://github.com/QwenLM/vllm-gptq
https://github.com/vllm-project/vllm/issues/2797
jklj077 commented 6 months ago

不要用https://github.com/QwenLM/vllm-gptq

如果更改了环境,特别是pytorch、xformers、cuda,需要重新编译或安装vllm,并仔细观察,以防vllm又改掉了这些依赖。

jklj077 commented 6 months ago

官方vllm已经支持GPTQ量化了,README相关部分已经更新。

abbydev commented 6 months ago

官方vllm已经支持GPTQ量化了,README相关部分已经更新。

好的,谢谢你,我重新试一下,晚点反馈结果

jklj077 commented 6 months ago

vllm需要预先分配显存cache block,它这个提示是说这些卡的显存一共能支持12656个token,但模型设置的单个序列长度就超过这个范围了。

这个可以按需要调整为一个小于12656的值,比如传入--max-model-len 12288这样

abbydev commented 6 months ago

@jklj077 验证OK,谢谢!