SparkJiao / llama-pipeline-parallel

A prototype repo for hybrid training of pipeline parallel and distributed data parallel with comments on core code snippets. Feel free to copy code and launch discussions about the problems you have encoured.
45 stars 2 forks source link

求助求助 #3

Closed newtonysls closed 1 year ago

newtonysls commented 1 year ago

请问有没有loss不正常的情况啊? 我改了pipeline并行后,train_batch,训练几个样本,loss直接爆炸,当我把学习率设置为0也是会变化,同时模型权重模型奇妙的也会变化,就很迷惑,但是使用eval batchloss都是处于正常范围内的

SparkJiao commented 1 year ago

这种情况我更倾向于是Label设置的不太对 比如你的某个sample里的label都是ignore_index. 可以首先尝试排除这部分的错误

newtonysls commented 1 year ago

我是实在找不到bug才咨询的,sample的lm label都是看过的,是在某个step突然loss就发生巨大变化,同时在那个step模型的权重一下也突然发生很大的变化,我把学习率设置为0也是😭

On Fri, 28 Jul 2023 at 20:47, Fangkai Jiao @.***> wrote:

这种情况我更倾向于是Label设置的不太对 比如你的某个sample里的label都是ignore_index. 可以首先尝试排除这部分的错误

— Reply to this email directly, view it on GitHub https://github.com/SparkJiao/llama-pipeline-parallel/issues/3#issuecomment-1655631567, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWFYAY44YINDHB263Y2LRN3XSOYF5ANCNFSM6AAAAAA23L7ZGQ . You are receiving this because you authored the thread.Message ID: @.***>

SparkJiao commented 1 year ago

loss发生大变化不是很大的问题吧 我理解只要不是NAN就没什么问题 能正常训练 / 几步以后loss正常下降即可 可以适当增大gradient_accumulation_steps平滑loss

newtonysls commented 1 year ago

从一点几突然变为10点几,还不大么🙋,就好离谱

On Mon, 31 Jul 2023 at 10:32, Fangkai Jiao @.***> wrote:

loss发生大变化不是很大的问题吧 我理解只要不是NAN就没什么问题 能正常训练 / 几步以后loss正常下降即可 可以适当增大gradient_accumulation_steps平滑loss

— Reply to this email directly, view it on GitHub https://github.com/SparkJiao/llama-pipeline-parallel/issues/3#issuecomment-1657406428, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWFYAYY3I3VJK7FVAXVANADXS4KMHANCNFSM6AAAAAA23L7ZGQ . You are receiving this because you authored the thread.Message ID: @.***>

SparkJiao commented 1 year ago

10+的loss很常见 可能只是你的数据问题

newtonysls commented 1 year ago

了解,感谢哈,我已经放弃用pipeline了,增大sequence length换一下其他方法

On Mon, 31 Jul 2023 at 18:19, Fangkai Jiao @.***> wrote:

10+的loss很常见 可能只是你的数据问题

— Reply to this email directly, view it on GitHub https://github.com/SparkJiao/llama-pipeline-parallel/issues/3#issuecomment-1658085119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWFYAYZZ6FN2UBNYGXJM6GLXS6BCLANCNFSM6AAAAAA23L7ZGQ . You are receiving this because you authored the thread.Message ID: @.***>

xinpeng-zhang commented 1 year ago

可以加个好友,和大佬们一起探讨下嘛,我的pp也有很大问题

newtonysls commented 1 year ago

我已经放弃使用llama的pipeline了 hh 没找出原因,llama的tokenizer对中文很不友好,后面换了基座模型了,节省显存也可以使用gradient ckpt,直接在llama源码里设置为true就好

On Thu, 31 Aug 2023 at 18:44, xinpeng-zhang @.***> wrote:

可以加个好友,和大佬们一起探讨下嘛,我的pp也有很大问题

— Reply to this email directly, view it on GitHub https://github.com/SparkJiao/llama-pipeline-parallel/issues/3#issuecomment-1700800426, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWFYAY7WSLWZVDJAV6IXXZ3XYBTHRANCNFSM6AAAAAA23L7ZGQ . You are receiving this because you authored the thread.Message ID: @.***>

xinpeng-zhang commented 1 year ago

我是在调chatglm2的pp,loss开始是10左右,然后训练几百step降到0.1。loss虽然在降,但是推理崩了,一直在重复同一个token

newtonysls commented 1 year ago

我的训练和你一样,前几个样本loss正常,后面直接增到10几,而且我发现在没达到梯度更新的时候模型权重就变化了,所以我也没找到原因,推理出来的结果也很怪异

On Thu, 31 Aug 2023 at 19:01, xinpeng-zhang @.***> wrote:

我是在调chatglm2的pp,loss开始是10左右,然后训练几百step降到0.1。loss虽然在降,但是推理崩了,一直在重复同一个token

— Reply to this email directly, view it on GitHub https://github.com/SparkJiao/llama-pipeline-parallel/issues/3#issuecomment-1700828238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWFYAY7AIEQ5UXPKRAJYWHTXYBVKPANCNFSM6AAAAAA23L7ZGQ . You are receiving this because you authored the thread.Message ID: @.***>

realgump commented 3 months ago

感觉可能是通信的问题: https://github.com/microsoft/DeepSpeed/issues/4726