如何将2.0-51B-hf这个HuggingFace版的模型进行pp和tp转换

IEIT-Yuan / Yuan-2.0

Yuan 2.0 Large Language Model

Other

681 stars 85 forks source link

如何将2.0-51B-hf这个HuggingFace版的模型进行pp和tp转换 #113

Open 18842685792 opened 8 months ago

18842685792 commented 8 months ago

github上提供了原版51B的转换方式，未提供2.0-51B-hf这个版本的转换方式

zhangzeru04 commented 8 months ago

github上提供了原版51B的转换方式，未提供2.0-51B-hf这个版本的转换方式

hf版的模型实现方式具有通用性所以我们没有专门提供。（1）可以通过自定义模型分割策略来实现模型并行，然后使用Accelerate可以完成分片checkpoint的加载操作(自定义device_map)，参考https://github.com/huggingface/accelerate。（2）如果您想要进行tensor并行推理的话可以参考https://github.com/BlackSamorez/tensor_parallel。

Shawn-IEITSystems commented 7 months ago

@18842685792 对于hf版本的模型，是希望我们提供一个转换好的模型，还是提供一个转换脚本？

18842685792 commented 7 months ago

因为没有专门学习过模型训练转换方面的知识，只是看这两个文档感觉无从下手，所以是希望提供一个基于CPU的转换脚本

18842685792 commented 7 months ago

这个进行张量和流水转换后推理速度能提升多少？

Shawn-IEITSystems commented 7 months ago

这个进行张量和流水转换后推理速度能提升多少？

@zhaoxudong01-ieisystem 请评估下

zhaoxudong01 commented 7 months ago

因为没有专门学习过模型训练转换方面的知识，只是看这两个文档感觉无从下手，所以是希望提供一个基于CPU的转换脚本

51B-hf开启张量并行不需要模型转换

import transformers
import tensor_parallel as tp
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/opt-13b")
model = transformers.AutoModelForCausalLM.from_pretrained("facebook/opt-13b")  # use opt-125m for testing

model = tp.tensor_parallel(model, ["cuda:0", "cuda:1"])  # <- each GPU has half the weights

inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"].to("cuda:0")
outputs = model.generate(inputs, num_beams=5)
print(tokenizer.decode(outputs[0])) # A cat sat on my lap for a few minutes ...

model(input_ids=inputs, labels=inputs).loss.backward()  # training works as usual

参考以上代码即可。

zhaoxudong01 commented 7 months ago

这个进行张量和流水转换后推理速度能提升多少？

在tensor_parallel的issue下给出了llama-7B和opt的2卡性能加速效果，可以作为参考。 https://github.com/BlackSamorez/tensor_parallel/issues/66