NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.58k stars 2.37k forks source link

[QUESTION] Checkpoint storage format #970

Closed syx11237744 closed 2 months ago

syx11237744 commented 3 months ago

Your question Ask a clear and concise question about Megatron-LM. Could you let me know which version I should revert to if I want to use the previous checkpoint storage format, which is stored as .pt? Or are there any other methods to save it as a .pt file?Thank you!

DATASET_PATH=/share/root/out_file/sum.jsonl SAVE_PATH=/share/sunyuanxu/out_file/sum VOCABFILE=gpt2/vocab.json MERGEFILE=gpt2/merges.txt

export CUDA_DEVICE_MAX_CONNECTIONS=1

GPUS_PER_NODE=1

Change for multinode config

MASTER_ADDR=localhost MASTER_PORT=6000 NNODES=1 NODE_RANK=0 WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

CHECKPOINT_PATH=/share/root/checkpoint/cp DATA_PATH=/share/root/out_file/sum_text_document DISTRIBUTED_ARGS=" --nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT "

GPT_ARGS=" --num-layers 24 \ --hidden-size 1024 \ --num-attention-heads 16 \ --seq-length 1024 \ --max-position-embeddings 1024 \ --micro-batch-size 32 \ --global-batch-size 256 \ --lr 0.00015 \ --train-iters 1000 \ --lr-decay-iters 320000 \ --lr-decay-style cosine \ --min-lr 1.0e-5 \ --weight-decay 1e-2 \ --lr-warmup-fraction .01 \ --clip-grad 1.0 \ --fp16 \ --attention-softmax-in-fp32 \ "

DATA_ARGS=" --data-path $DATA_PATH \ --vocab-file $VOCAB_FILE \ --merge-file $MERGE_FILE \ --split 949,50,1 "

OUTPUT_ARGS=" --log-interval 100 \ --save-interval 10000 \ --eval-interval 1000 \ --eval-iters 10 "

torchrun $DISTRIBUTED_ARGS Megatron-LM/pretrain_gpt.py \ $GPT_ARGS \ $DATA_ARGS \ $OUTPUT_ARGS \ --distributed-backend nccl \ --save $CHECKPOINT_PATH \ --load $CHECKPOINT_PATH