PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.1k stars 5.55k forks source link

cuda使用最新develop版本跑mp_async_allreduce报错 #58847

Closed zhurou603 closed 10 months ago

zhurou603 commented 10 months ago

bug描述 Describe the Bug

image

SCRIPT_HOME=$(cd $(dirname $0); pwd)

CARDS="0,1,2,3,4,5,6,7"
export NCCL_SHM_DISABLE=1
export FLAGS_call_stack_level=2

task_name="llama-13b_mp2pp2dp2_overlap"
rm -rf "$SCRIPT_HOME/output/$task_name/"
rm -rf "$SCRIPT_HOME/output/${task_name}_log"

TP=2
PP=1
SHARDING_STAGE="stage2"

rm -rf $SCRIPT_HOME/profiler/${task_name}

#PROFILER_OPTIONS="batch_range=[0,3]; profile_path=./profiler/${task_name}; record_shapes=False; exited_on_finished=False; timer_only=False; profile_memory=False" \
CUDA_LAUNCH_BLOCKING=0 \
python -u  -m paddle.distributed.launch \
    --devices=$CARDS \
    --log_dir "output/$task_name""_log" \
    run_pretrain.py \
    --model_type "llama" \
    --model_name_or_path "facebook/llama-7b" \
    --tokenizer_name_or_path "facebook/llama-7b" \
    --input_dir "./data" \
    --output_dir "output/$task_name" \
    --split 949,50,1 \
    --max_seq_length 512 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --use_flash_attention 0 \
    --use_fused_rms_norm 0 \
    --fp16 \
    --fp16_opt_level "O2" \
    --scale_loss 1024 \
    --amp_master_grad 1 \
    --max_grad_norm 1.0 \
    --tensor_parallel_degree $TP \
    --pipeline_parallel_degree $PP \
    --sharding $SHARDING_STAGE \
    --learning_rate 1.0e-5 \
    --min_learning_rate 1.0e-7 \
    --max_steps 10 \
    --save_steps 10000 \
    --weight_decay 0.01 \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.02 \
    --logging_steps 1 \
    --dataloader_num_workers 0 \
    --gradient_accumulation_steps 1 \
    --eval_steps 1000 \
    --report_to "visualdl" \
    --disable_tqdm true \
    --continue_training 0 \
    --recompute 1 \
    --do_train \
    --device "gpu" \
    --overwrite_output_dir True \
    --amp_custom_black_list 'c_embedding' 'elementwise_mul' \
    --tensor_parallel_config "enable_mp_async_allreduce"
zhurou603 commented 10 months ago

image

zhurou603 commented 10 months ago

image

zhurou603 commented 10 months ago

paddle/distributed/fleet/layers/mpu/mp_layers.py:223 image image

import paddle
x = paddle.rand([1, 512, 4096], dtype='float32')
y = paddle.rand([5504, 4096], dtype='float16')
z = paddle.matmul(x, y, transpose_y=True)
print(z.shape)
zhurou603 commented 10 months ago

提交PR修复:https://github.com/PaddlePaddle/Paddle/pull/58858

Caozhou1995 commented 10 months ago

已提PR,可关闭