[BUG] Qwen/Qwen-72B-Chat-Int8，不能多GPU并行计算

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

Qwen/Qwen-72B-Chat-Int8 不能在 4张 V100 32GB GPU上并行计算，虽然使用了4张GPU的显存，但是同一时间只使用了一张GPU进行计算：

期望行为 | Expected Behavior

4张GPU均分算力，并行计算

复现方法 | Steps To Reproduce

使用以下代码必现： from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-72B-Chat-Int8", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-72B-Chat-Int8", device_map="auto", trust_remote_code=True ).eval()

response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs") print(response)

运行环境 | Environment

- OS: Ubuntu 22.04
- Python: 3.10.6
- Transformers: 4.39.3
- PyTorch: 2.1.0+cu121
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1

备注 | Anything else?

代码使用的是huggingface官方Quickstart： https://huggingface.co/Qwen/Qwen-72B-Chat-Int8 No response

QwenLM / Qwen