model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-72B-Chat-Int8",
device_map="auto",
trust_remote_code=True
).eval()
response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs")
print(response)
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
Qwen/Qwen-72B-Chat-Int8 不能在 4张 V100 32GB GPU上并行计算,虽然使用了4张GPU的显存,但是同一时间只使用了一张GPU进行计算:
期望行为 | Expected Behavior
4张GPU均分算力,并行计算
复现方法 | Steps To Reproduce
使用以下代码必现: from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-72B-Chat-Int8", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen-72B-Chat-Int8", device_map="auto", trust_remote_code=True ).eval()
response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs") print(response)
运行环境 | Environment
备注 | Anything else?
代码使用的是huggingface官方Quickstart: https://huggingface.co/Qwen/Qwen-72B-Chat-Int8 No response