QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

量化模型,如何分布式流式推理。 #896

Closed YYGe01 closed 9 months ago

YYGe01 commented 9 months ago

您好,Qwen-14B-Chat-Int4 量化模型,按照教程在单GPU上,已实现流式输出推理。

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path,
                                                  device_map="auto",
                                                  trust_remote_code=True)
model = model.to(device)
model.generation_config = GenerationConfig.from_pretrained(model_path, trust_remote_code=True)
for response in model.chat_stream(tokenizer, 
                                       query,
                                       history=history,
                                       generation_config=model.generation_config
                                       ):

我现在有4张GPU,请问怎么样可以分布式推理呢,谢谢。

YYGe01 commented 9 months ago

能否可以提供一个直接能用的脚本呀

jklj077 commented 9 months ago

请看REAMDE部署部分。