epfLLM / Megatron-LLM

distributed trainer for LLMs
Other
504 stars 73 forks source link

Support specifying load_iters for checkpoint #84

Closed xingyaoww closed 8 months ago

xingyaoww commented 9 months ago

Support converting a sharded checkpoint with a specified iteration back to unshared version.

For example, you can set $LOAD_ITER to 52 to load the checkpoint of 52nd iteration $LOAD_DIR/iter_0000052. This will override the read iteration number from the tracker file.

python Megatron-LLM/tools/checkpoint_util.py \
    --target_tensor_parallel_size 1 \
    --target_pipeline_parallel_size 1 \
    --load_dir $LOAD_DIR \
    --load_iters $LOAD_ITER \
    --save_dir $OUTPUT_DIR \
    --model_type llama2 \
    --true_vocab_size $VOCAB_SIZE \
    --bf16
xingyaoww commented 8 months ago

Hi @AleHD , Thanks for your feedback! I checked the two additional places:

Both _get_models AND _setup_model_and_optimizer calls load_checkpointin megatron/checkpointing.py to load models, which in turns calls _load_base_checkpoint in the same file (_get_models, _setup_model_and_optimizer, load_checkpoint function).

Since my modification directly modifies _load_base_checkpoint (code here) and will override the metadata iteration when load_iters is specified, do we still need to explicitly modify these two functions?

xingyaoww commented 8 months ago

@AleHD Thanks a lot! I have accept the two suggestions!