alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
718 stars 102 forks source link

AssertionError: Rank 11: found NaN in local grad norm in backward pass before data-parallel communication collective. Device: 3 #366

Open lanfengmo opened 4 weeks ago

lanfengmo commented 4 weeks ago

AssertionError: Rank 11: found NaN in local grad norm in backward pass before data-parallel communication collective. Device: 3

error1 4机llama3-70B训练,几个迭代打印后报错,训练脚本如下:

cat pretrain_llama3_70B_tp4_pp8.sh cd /workspace/Pai-Megatron-Patch/examples/llama3 sh run_pretrain_llama_70b.sh \ dsw \ 70B \ 1 \ 1024 \ 1e-7 \ 1e-8 \ 128 \ 128 \ bf16 \ 4 \ 8 \ 70B \ sel \ true \ false \ false \ true \ 100000 \ /mnt/llama3-datasets/wudao_llama3bpe_content_document \ /mnt/llama3-ckpts/Meta-Llama-3-70B-tp4-pp8 \ 10000000 \ 1 \ /mnt/output_megatron_llama3_70B

jerryli1981 commented 3 weeks ago

您好,我在qwen2.5的72b上执行四机继续预训练很长时间也没有出现这个问题,您或者先试试llama3.1或者qwen2.5呢?

jerryli1981 commented 3 weeks ago

另外方便进群加我们详细聊下吗

lanfengmo commented 3 weeks ago

您好,我在qwen2.5的72b上执行四机继续预训练很长时间也没有出现这个问题,您或者先试试llama3.1或者qwen2.5呢?

好的,我切换模型试试。梯度爆炸这个问题试过降低学习率,限制梯度--clip-grad 1.0等方法,问题还是存在