alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
674 stars 94 forks source link

保存的checkpoints中缺少distrib_optim.pt #315

Closed shizikachen closed 1 month ago

shizikachen commented 2 months ago

我使用了Pai-Megatron-Patch/examples/llama3/run_pretrain_megatron_llama.sh来训练我的模型,具体参数如下: MEGATRON_PATCH_PATH=../Pai-Megatron-Patch-main-version2 MEGATRON_PATH=${MEGATRON_PATCH_PATH}/Megatron-LM-231007 MODEL_SIZE=8B BATCH_SIZE=1 GLOBAL_BATCH_SIZE=8192 LR=2.0e-5 MIN_LR=1.0e-6 SEQ_LEN=8192 PAD_LEN=8192 EXTRA_VOCAB_SIZE=1 PR=bf16 TP=1 PP=8 AC=sel DO=true FL=true SP=true TE=false megatron_options=" \ --save ${SAVED_PRETRAIN_CHECKPOINT_PATH} \ --split 99,1,0 \ --train-data-path ${DATASET_PATH} \ --data-path ${DATASET_PATH} \ --lr ${LR} \ --min-lr ${MIN_LR} \ --lr-decay-style linear \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --weight-decay 0.1 \ --clip-grad 1.0 \ --init-method-std 0.006 \ --lr-decay-iters ${LR_DECAY_ITERS} \ --lr-warmup-iters ${LR_WARMUP_ITERS} \ --train-iters ${TRAIN_ITERS} \ --micro-batch-size ${BATCH_SIZE} \ --global-batch-size ${GLOBAL_BATCH_SIZE} \ --num-layers ${NUM_LAYERS} \ --hidden-size ${HIDDEN_SIZE} \ --num-attention-heads ${NUM_ATTN_HEADS} \ --ffn-hidden-size ${INTERMEDIATE_SIZE} \ --seq-length ${SEQ_LEN} \ --max-position-embeddings ${MAX_POSITION_EMBEDDINGS} \ --max-padding-length ${PAD_LEN} \ --log-interval 1 \ --eval-interval 10000 \ --eval-iters 10 \ --save-interval ${SAVE_INTERVAL} \ --tensorboard-queue-size 1 \ --tensorboard-dir ${TENSORBOARD_DIR} \ --log-timers-to-tensorboard \ --log-batch-size-to-tensorboard \ --log-validation-ppl-to-tensorboard \ --tensor-model-parallel-size ${TP} \ --pipeline-model-parallel-size ${PP} \ --dataset LLama-Pretrain-Idxmap \ --num-workers 8 \ --seed 1234 \ --extra-vocab-size ${EXTRA_VOCAB_SIZE} \ --vocab-file ${VOCAB_PATH} \ --merge-file ${MERGE_PATH} \ --swiglu \ --normalization RMSNorm \ --use-rotary-position-embeddings \ --position-embedding-type rope \ --untie-embeddings-and-output-weights \ --rotary-base 500000 \ --attention-dropout 0.0 \ --hidden-dropout 0.0 \ --disable-bias-linear \ --norm-epsilon 1e-05 \

训练由于意外中断了,我想要继续训练时,发现保存的参数只有model_optim_rng.pt,导致无法加载。 请问这种情况应该如何解决?