量化模型，如何分布式流式推理。

YYGe01 commented 9 months ago

您好，Qwen-14B-Chat-Int4 量化模型，按照教程在单GPU上，已实现流式输出推理。

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path,
                                                  device_map="auto",
                                                  trust_remote_code=True)
model = model.to(device)
model.generation_config = GenerationConfig.from_pretrained(model_path, trust_remote_code=True)
for response in model.chat_stream(tokenizer, 
                                       query,
                                       history=history,
                                       generation_config=model.generation_config
                                       ):

我现在有4张GPU，请问怎么样可以分布式推理呢，谢谢。

YYGe01 commented 9 months ago

能否可以提供一个直接能用的脚本呀

jklj077 commented 9 months ago

请看REAMDE部署部分。

QwenLM / Qwen

量化模型，如何分布式流式推理。 #896