FranxYao / Long-Context-Data-Engineering

Implementation of paper Data Engineering for Scaling Language Models to 128K Context
416 stars 26 forks source link

first step loss of continue pretrained on 80K #13

Closed ftgreat closed 5 months ago

ftgreat commented 5 months ago

请问作者,继续训练时刚开始几步的loss大概什么scale,谢谢

FranxYao commented 5 months ago

step 1 token 5m train/loss 5.5021 train/grad_norm 249.4825 train/learning_rate 0.0000 train/epoch 0.0000 train/per_length_loss/loss@2 4.8841 train/per_length_loss/loss@4 4.2377 train/per_length_loss/loss@8 3.7006 train/per_length_loss/loss@16 2.9392 train/per_length_loss/loss@32 2.4782 train/per_length_loss/loss@64 2.2841 train/per_length_loss/loss@128 2.0425 train/per_length_loss/loss@256 1.8967 train/per_length_loss/loss@512 2.0560 train/per_length_loss/loss@1024 2.4999 train/per_length_loss/loss@2048 2.8781 train/per_length_loss/loss@4096 3.4491 train/per_length_loss/loss@8192 4.2185 train/per_length_loss/loss@16384 4.8923 train/per_length_loss/loss@32768 5.3619 train/per_length_loss/loss@65536 5.8821

step 50 token 250m train/loss 2.2328 train/grad_norm 12.5982 train/learning_rate 0.0000 train/epoch 0.0300 train/per_length_loss/loss@2 5.6365 train/per_length_loss/loss@4 4.8280 train/per_length_loss/loss@8 3.8832 train/per_length_loss/loss@16 3.1913 train/per_length_loss/loss@32 2.5983 train/per_length_loss/loss@64 2.3225 train/per_length_loss/loss@128 2.1339 train/per_length_loss/loss@256 1.9717 train/per_length_loss/loss@512 1.9957 train/per_length_loss/loss@1024 2.0113 train/per_length_loss/loss@2048 1.9699 train/per_length_loss/loss@4096 2.0088 train/per_length_loss/loss@8192 1.9869 train/per_length_loss/loss@16384 1.9951 train/per_length_loss/loss@32768 2.1541 train/per_length_loss/loss@65536 2.2478

step 100 token 500m train/loss 1.8736 train/grad_norm 4.4914 train/learning_rate 0.0000 train/epoch 0.0500 train/per_length_loss/loss@2 6.0829 train/per_length_loss/loss@4 4.9588 train/per_length_loss/loss@8 3.9100 train/per_length_loss/loss@16 3.0207 train/per_length_loss/loss@32 2.5083 train/per_length_loss/loss@64 2.2451 train/per_length_loss/loss@128 2.1260 train/per_length_loss/loss@256 1.9022 train/per_length_loss/loss@512 1.9313 train/per_length_loss/loss@1024 1.8898 train/per_length_loss/loss@2048 1.9413 train/per_length_loss/loss@4096 1.8918 train/per_length_loss/loss@8192 1.8172 train/per_length_loss/loss@16384 1.7880 train/per_length_loss/loss@32768 1.8784 train/per_length_loss/loss@65536 1.8538

step 500 token 2500m train/loss 1.6361 train/grad_norm 2.9198 train/learning_rate 0.0000 train/epoch 0.2500 train/per_length_loss/loss@2 5.3172 train/per_length_loss/loss@4 4.6497 train/per_length_loss/loss@8 3.5052 train/per_length_loss/loss@16 2.9466 train/per_length_loss/loss@32 2.4823 train/per_length_loss/loss@64 2.0362 train/per_length_loss/loss@128 1.8864 train/per_length_loss/loss@256 1.8654 train/per_length_loss/loss@512 1.8011 train/per_length_loss/loss@1024 1.7140 train/per_length_loss/loss@2048 1.6129 train/per_length_loss/loss@4096 1.6609 train/per_length_loss/loss@8192 1.6198 train/per_length_loss/loss@16384 1.6394 train/per_length_loss/loss@32768 1.5605 train/per_length_loss/loss@65536 1.6188

ftgreat commented 5 months ago

谢谢您的回复。 关于有个疑问: 模型A在4k的样本上收敛时loss是2.0左右,接下来进行continue pretrain(不加载优化器状态)。 对于相同数据(shuffle)处理成32k的样本,初始的loss会在10.0左右。 看您上面给的loss也是类似的情况,前几个steps的loss并不是模型A的收敛loss大小。

想问下这种情况是什么原因?谢谢 会不会跟32k的样本packing了不同的doc有关?

FranxYao commented 5 months ago

Mostly because the loss after 4K becomes quite large

On Fri, Mar 29, 2024 at 20:35 ldwang @.***> wrote:

谢谢您的回复。 关于有个疑问: 模型A在4k的样本上收敛时loss是2.0左右,接下来进行continue pretrain(不加载优化器状态)。 对于相同数据(shuffle)处理成32k的样本,初始的loss会在10.0左右。 看您上面给的loss也是类似的情况,前几个steps的loss并不是模型A的收敛loss。 想问下这种情况是什么原因?谢谢

— Reply to this email directly, view it on GitHub https://github.com/FranxYao/Long-Context-Data-Engineering/issues/13#issuecomment-2027184841, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHHCHOYGAKWRE2AJ5P36XDY2VNX3AVCNFSM6AAAAABFOGVST2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXGE4DIOBUGE . You are receiving this because you commented.Message ID: @.***>

ftgreat commented 5 months ago

Mostly because the loss after 4K becomes quite large On Fri, Mar 29, 2024 at 20:35 ldwang @.> wrote: 谢谢您的回复。 关于有个疑问: 模型A在4k的样本上收敛时loss是2.0左右,接下来进行continue pretrain(不加载优化器状态)。 对于相同数据(shuffle)处理成32k的样本,初始的loss会在10.0左右。 看您上面给的loss也是类似的情况,前几个steps的loss并不是模型A的收敛loss。 想问下这种情况是什么原因?谢谢 — Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHHCHOYGAKWRE2AJ5P36XDY2VNX3AVCNFSM6AAAAABFOGVST2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXGE4DIOBUGE . You are receiving this because you commented.Message ID: @.>

明白了,谢谢您的回复。 4k到32k这种情况loss还是会快速下降并收敛,比如50steps后。