NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.04k stars 2.26k forks source link

[BUG]"Unexpected key(s) in state_dict" while loading Llama-megatron checkpoint. #1132

Open mxjmtxrm opened 1 week ago

mxjmtxrm commented 1 week ago

Hi, I tried to finetune Llama2-7b-chat model using megatron. I downloaded the hf checkpoint and convert it to GPT megatron checkpoint referring [https://github.com/NVIDIA/Megatron-LM/blob/fe1640a3cc4866e015bfdb6449f0d1943d2243cb/docs/llama_mistral.md?plain=1#L73](). The command I used is:

python tools/checkpoint/convert.py \
    --model-type GPT \
    --loader llama_mistral \
    --saver megatron \
    --target-tensor-parallel-size 1 \
    --checkpoint-type hf \
    --model-size llama2-7Bf \
    --load-dir Llama-2-7b-chat-hf \
    --save-dir ./Llama-2-7b-chat-pp1 \
    --tokenizer-model Llama-2-7b-chat-hf/tokenizer.model

Then I tried to train the llama:

#!/bin/bash

DISTRIBUTED_ARGS=(
    --nproc_per_node $GPUS_PER_NODE 
    --nnodes $NUM_NODES 
    --master_addr $MASTER_ADDR 
    --master_port $MASTER_PORT
)

GPT_MODEL_ARGS=(
    --num-layers ${NUM_LAYERS} 
    --hidden-size ${HIDDEN_SIZE} 
    --num-attention-heads ${NUM_HEAD} 
    --ffn-hidden-size ${FFN_HIDDEN_SIZE} 
    --position-embedding-type rope 
    --max-position-embeddings ${MAX_POSITION_EMBEDDINGS} 
    --seq-length 4096 
    --max-position-embeddings 4096 
)

TRAINING_ARGS=(
    --micro-batch-size 1 
    --global-batch-size 32 
    --train-iters 50 
    --weight-decay 0.1 
    --adam-beta1 0.9 
    --adam-beta2 0.95 
    --init-method-std 0.006 
    --clip-grad 1.0 
    --bf16
    --lr 6.0e-5 
    --lr-decay-style cosine 
    --min-lr 6.0e-6
    --lr-warmup-fraction .001 
    --lr-decay-iters 30 
    --no-load-rng 
    --no-load-optim
    --exit-on-missing-checkpoint
    --use-checkpoint-args 
    --untie-embeddings-and-output-weights 
    --use-rotary-position-embeddings
    --use-flash-attn 
    --no-position-embedding
    --no-masked-softmax-fusion
    --attention-softmax-in-fp32
)

MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 1
    --pipeline-model-parallel-size 1
)

DATA_ARGS=(
    --data-path $DATA_PATH 
    --split 949,50,1
    --tokenizer-model ${TOKENIZER_PATH}
    --data-cache-path ./data_cache 
    --tokenizer-type Llama2Tokenizer
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 1
    --save-interval 5 
    --eval-interval 5 
    --save $CHECKPOINT_PATH 
    --load $CHECKPOINT_PATH 
    --eval-iters 10
    --tensorboard-dir $TENSORBOARD_LOGS_PATH 
)

torchrun ${DISTRIBUTED_ARGS[@]} pretrain_gpt.py \
    ${GPT_MODEL_ARGS[@]} \
    ${TRAINING_ARGS[@]} \
    ${MODEL_PARALLEL_ARGS[@]} \
    ${DATA_ARGS[@]} \
    ${EVAL_AND_LOGGING_ARGS[@]}

I met the following error:

RuntimeError: Error(s) in loading state_dict for GPTModel:
    Missing key(s) in state_dict: "embedding.word_embeddings.weight", "decoder.layers.0.self_attention.linear_proj.weight",...
Unexpected key(s) in state_dict: "language_model".

How to solve this problem?

lmcafee-nvidia commented 2 days ago

@mxjmtxrm , our instructions could be clearer in these docs regarding the compatibility between the converter's --saver arg and the training model format. There are two model formats, legacy (a.k.a., 'megatron') and mcore. In the docs and in your command above, --saver megatron saves to the legacy format, but during training, the default format is mcore, unless otherwise specified. There are two options for your issue:

Let me know if you have any questions.