alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
674 stars 94 forks source link

seq len开大时,初始loss不正常 #300

Closed Jayce1kk closed 1 month ago

Jayce1kk commented 2 months ago

按照Megatron-LM-Dense框架(不是mcore)readme的流程,对llama3进行hf2megatron转换,tp开4pp开2,然后进行多机器训练时(2台4090),当seq_len开1024以上时,loss不正常为5~6;但是开1024时loss为2.1正常。想请教一下为什么

wuduher commented 2 months ago

同问

wuduher commented 2 months ago

同问

我通过换mcore框架解决,不过mcore模型转换好像不支持pp并行?

jerryli1981 commented 1 month ago

您好,megatron的已修复,烦请CR:https://github.com/alibaba/Pai-Megatron-Patch/pull/317