QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

[BUG] Qwen/Qwen-72B-Chat-Int8,不能多GPU并行计算 #1222

Closed gquanma closed 5 months ago

gquanma commented 5 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

Qwen/Qwen-72B-Chat-Int8 不能在 4张 V100 32GB GPU上并行计算,虽然使用了4张GPU的显存,但是同一时间只使用了一张GPU进行计算: image

期望行为 | Expected Behavior

4张GPU均分算力,并行计算

复现方法 | Steps To Reproduce

使用以下代码必现: from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-72B-Chat-Int8", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-72B-Chat-Int8", device_map="auto", trust_remote_code=True ).eval()

response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs") print(response)

运行环境 | Environment

- OS: Ubuntu 22.04
- Python: 3.10.6
- Transformers: 4.39.3
- PyTorch: 2.1.0+cu121
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1

备注 | Anything else?

代码使用的是huggingface官方Quickstart: https://huggingface.co/Qwen/Qwen-72B-Chat-Int8 No response

jklj077 commented 5 months ago

As explained in many previous issues, transformers only supports basic model parallel. If you wish to use tensor parallel, try vLLM.