Open ZLkanyo009 opened 6 months ago
7_gpu_pp1.log 31_gpu_pp4.log
在跑llama2 70B(减少层数)时,PP=1跟PP=4出现loss下降趋势不同的情况,log与曲线图见上述上传,脚本如下:
export CUDA_DEVICE_MAX_CONNECTIONS=1 GPUS_PER_NODE=8 # Change for multinode config MASTER_ADDR=192.167.5.2 MASTER_PORT=29501 NUM_NODES=4 NODE_RANK=0 WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES)) CHECKPOINT_PATH='/data/zhangling21/ckpts/' TENSORBOARD_LOGS_PATH='/data/zhangling21/tensorboard_logs/' TOKENIZER_PATH='/data/zhangling21/llama_00_text_document/tokenizer/tokenizer.model' DATA_PATH='/data/zhangling21/llama_00_text_document/llama_00_text_document' DISTRIBUTED_ARGS=( --nproc_per_node $GPUS_PER_NODE --nnodes $NUM_NODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT ) # --tokenizer-type LLaMASentencePieceTokenizer \ # --rmsnorm-epsilon 1e-5 LLAMA_MODEL_ARGS=( --num-layers 8 --hidden-size 8192 --ffn-hidden-size 28672 --num-attention-heads 64 --seq-length 4096 --max-position-embeddings 4096 --group-query-attention --num-query-groups 8 --tokenizer-type Llama2Tokenizer --tokenizer-model $TOKENIZER_PATH --swiglu --normalization RMSNorm --use-rotary-position-embeddings --no-position-embedding --disable-bias-linear ) # --optimizer adam # --adam-eps 1e-05 # --no-contiguous-buffers-in-local-ddp # --recompute-method uniform # --no-async-tensor-model-parallel-allreduce # --embedding-dropout 0 # --multi-query-attention # --multi-query-group-num 8 # --ffn-dim-multiplier 1.3 # --recompute-granularity full # --distribute-saved-activations # --recompute-num-layers 1 # --memory-saving # --fp16 # --optimizer adam # --adam-eps 1e-05 TRAINING_ARGS=( --micro-batch-size 1 --global-batch-size 44 --train-samples 24414 --weight-decay 1e-2 --optimizer adam --clip-grad 1.0 --lr 0.00015 --lr-decay-style cosine --min-lr 1.0e-5 --lr-warmup-fraction .01 --adam-beta1 0.9 --adam-beta2 0.95 --attention-dropout 0.0 --hidden-dropout 0.0 --untie-embeddings-and-output-weights --multiple-of 4096 --no-gradient-accumulation-fusion --recompute-granularity 'full' --recompute-num-layers 1 --recompute-method 'uniform' --no-async-tensor-model-parallel-allreduce ) MODEL_PARALLEL_ARGS=( --tensor-model-parallel-size 8 --pipeline-model-parallel-size 4 ) DATA_ARGS=( --data-path $DATA_PATH --split 1 ) EVAL_AND_LOGGING_ARGS=( --log-interval 1 --init-method-std 0.02 --seed 1234 --eval-iters 0 --use-cpu-initialization ) #--load "/data/zhangling21/llama_00_text_document/ckpt0227_8L" #--no-load-rng #--save "/data/zhangling21/llama_00_text_document/ckpt0227_8L" #--save-interval 1 cmd="torchrun ${DISTRIBUTED_ARGS[@]} pretrain_llama.py \ ${LLAMA_MODEL_ARGS[@]} \ ${TRAINING_ARGS[@]} \ ${MODEL_PARALLEL_ARGS[@]} \ ${DATA_ARGS[@]} \ ${EVAL_AND_LOGGING_ARGS[@]}" echo $cmd eval $cmd
@zhaoyinglia 您好,能麻烦看一下这个问题吗? @aoyulong 最近比较忙
7_gpu_pp1.log 31_gpu_pp4.log
在跑llama2 70B(减少层数)时,PP=1跟PP=4出现loss下降趋势不同的情况,log与曲线图见上述上传,脚本如下: