PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
11.98k stars 2.92k forks source link

[Bug]: llama-13b amp_master_grad+loss_scale=65535 训练loss为0 #7160

Closed BeingGod closed 9 months ago

BeingGod commented 11 months ago

软件环境

- paddlepaddle: N/A
- paddlepaddle-gpu: 2.5.1
- paddlenlp: 68bb39d

重复问题

错误描述

llama-13b 单机八卡A100 mp=8训练开启amp_master_grad,当loss_scale=65536时,在step=59 时loss会降至0。关闭amp_master_grad正常。

猜测:开启amp_master_grad后,`check_nan_or_inf_and_unscale`不生效。由于loss_scale过大,导致梯度中出现inf,从而出现loss为0?

稳定复现步骤 & 代码

训练脚本:

set -x
SCRIPT_HOME=$(cd $(dirname $0); pwd)

CARDS="0,1,2,3,4,5,6,7"
export NCCL_SHM_DISABLE=1

task_name="llama_hybrid"
rm -rf "$SCRIPT_HOME/output/$task_name/"
rm -rf "$SCRIPT_HOME/output/${task_name}_log"

TP=8
PP=1
SHARDING_STAGE="stage2"

rm -rf $SCRIPT_HOME/profiler/llama

python -u  -m paddle.distributed.launch \
    --devices=$CARDS \
    --log_dir "output/$task_name""_log" \
    run_pretrain.py \
    --model_type "llama" \
    --model_name_or_path "facebook/llama-13b" \
    --tokenizer_name_or_path "facebook/llama-13b" \
    --input_dir "./data" \
    --output_dir "output/$task_name" \
    --split 949,50,1 \
    --max_seq_length 512 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --use_flash_attention 0 \
    --use_fused_rms_norm 0 \
    --fp16 \
    --fp16_opt_level "O2" \
    --scale_loss 65536 \
    --amp_master_grad 1 \
    --tensor_parallel_degree $TP \
    --pipeline_parallel_degree $PP \
    --sharding $SHARDING_STAGE \
    --learning_rate 1.0e-5 \
    --min_learning_rate 1.0e-7 \
    --max_steps 10000 \
    --save_steps 10000 \
    --weight_decay 0.01 \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.02 \
    --logging_steps 1 \
    --dataloader_num_workers 0 \
    --gradient_accumulation_steps 1 \
    --eval_steps 1000 \
    --report_to "visualdl" \
    --disable_tqdm true \
    --continue_training 0 \
    --recompute 1 \
    --do_train \
    --device "gpu" \
    --overwrite_output_dir True \
    --amp_custom_black_list 'matmul' 'matmul_v2' 'mul'

日志文件: workerlog.log

ZHUI commented 11 months ago

看你好像用的fp16,建议使用bf16. FP16确实容易异常。

github-actions[bot] commented 9 months ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] commented 9 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。