alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
674 stars 94 forks source link

Missing key(s) in state_dict llama3 mcore转换后权重不匹配 #303

Closed wuduher closed 2 months ago

wuduher commented 2 months ago

报错信息: RuntimeError: Error(s) in loading state_dict for GPTModel: Missing key(s) in state_dict: "decoder.layers.0.self_attention.linear_proj._extra_state", "decoder.layers.0.self_attention.linear_qkv._extra_state", "decoder.layers.0.mlp.linear_fc1._extra_state", "decoder.layers.0.mlp.linear_fc2._extra_state", "decoder.layers.1.self_attention.linear_proj._extra_state", "decoder.layers.1.self_attention.linear_qkv._extra_state", "decoder.layers.1.mlp.linear_fc1._extra_state", "decoder.layers.1.mlp.linear_fc2._extra_state", "decoder.layers.2.self_attention.linear_proj._extra_state", "decoder.layers.2.self_attention.linear_qkv._extra_state", "decoder.layers.2.mlp.linear_fc1._extra_state", "decoder.layers.2.mlp.linear_fc2._extra_state", "decoder.layers.3.self_attention.linear_proj._extra_state", "decoder.layers.3.self_attention.linear_qkv._extra_state", "decoder.layers.3.mlp.linear_fc1._extra_state", "decoder.layers.3.mlp.linear_fc2._extra_state", "decoder.layers.4.self_attention.linear_proj._extra_state", "decoder.layers.4.self_attention.linear_qkv._extra_state", "decoder.layers.4.mlp.linear_fc1._extra_state", "decoder.layers.4.mlp.linear_fc2._extra_state", "decoder.layers.5.self_attention.linear_proj._extra_state", "decoder.layers.5.self_attention.linear_qkv._extra_state", "decoder.layers.5.mlp.linear_fc1._extra_state", "decoder.layers.5.mlp.linear_fc2._extra_state", "decoder.layers.6.self_attention.linear_proj._extra_state", "decoder.layers.6.self_attention.linear_qkv._extra_state", "decoder.layers.6.mlp.linear_fc1._extra_state", "decoder.layers.6.mlp.linear_fc2._extra_state", "decoder.layers.7.self_attention.linear_proj._extra_state", "decoder.layers.7.self_attention.linear_qkv._extra_state", "decoder.layers.7.mlp.linear_fc1._extra_state", "decoder.layers.7.mlp.linear_fc2._extra_state", "decoder.layers.8.self_attention.linear_proj._extra_state", "decoder.layers.8.self_attention.linear_qkv._extra_state", "decoder.layers.8.mlp.linear_fc1._extra_state", "decoder.layers.8.mlp.linear_fc2._extra_state", "decoder.layers.9.self_attention.linear_proj._extra_state", "decoder.layers.9.self_attention.linear_qkv._extra_state", "decoder.layers.9.mlp.linear_fc1._extra_state", "decoder.layers.9.mlp.linear_fc2._extra_state", "decoder.layers.10.self_attention.linear_proj._extra_state", "decoder.layers.10.self_attention.linear_qkv._extra_state", "decoder.layers.10.mlp.linear_fc1._extra_state", "decoder.layers.10.mlp.linear_fc2._extra_state", "decoder.layers.11.self_attention.linear_proj._extra_state", "decoder.layers.11.self_attention.linear_qkv._extra_state", "decoder.layers.11.mlp.linear_fc1._extra_state", "decoder.layers.11.mlp.linear_fc2._extra_state", "decoder.layers.12.self_attention.linear_proj._extra_state", "decoder.layers.12.self_attention.linear_qkv._extra_state", "decoder.layers.12.mlp.linear_fc1._extra_state", "decoder.layers.12.mlp.linear_fc2._extra_state", "decoder.layers.13.self_attention.linear_proj._extra_state", "decoder.layers.13.self_attention.linear_qkv._extra_state", "decoder.layers.13.mlp.linear_fc1._extra_state", "decoder.layers.13.mlp.linear_fc2._extra_state", "decoder.layers.14.self_attention.linear_proj._extra_state", "decoder.layers.14.self_attention.linear_qkv._extra_state", "decoder.layers.14.mlp.linear_fc1._extra_state", "decoder.layers.14.mlp.linear_fc2._extra_state", "decoder.layers.15.self_attention.linear_proj._extra_state", "decoder.layers.15.self_attention.linear_qkv._extra_state", "decoder.layers.15.mlp.linear_fc1._extra_state", "decoder.layers.15.mlp.linear_fc2._extra_state", "decoder.layers.16.self_attention.linear_proj._extra_state", "decoder.layers.16.self_attention.linear_qkv._extra_state", "decoder.layers.16.mlp.linear_fc1._extra_state", "decoder.layers.16.mlp.linear_fc2._extra_state", "decoder.layers.17.self_attention.linear_proj._extra_state", "decoder.layers.17.self_attention.linear_qkv._extra_state", "decoder.layers.17.mlp.linear_fc1._extra_state", "decoder.layers.17.mlp.linear_fc2._extra_state", "decoder.layers.18.self_attention.linear_proj._extra_state", "decoder.layers.18.self_attention.linear_qkv._extra_state", "decoder.layers.18.mlp.linear_fc1._extra_state", "decoder.layers.18.mlp.linear_fc2._extra_state", "decoder.layers.19.self_attention.linear_proj._extra_state", "decoder.layers.19.self_attention.linear_qkv._extra_state", "decoder.layers.19.mlp.linear_fc1._extra_state", "decoder.layers.19.mlp.linear_fc2._extra_state", "decoder.layers.20.self_attention.linear_proj._extra_state", "decoder.layers.20.self_attention.linear_qkv._extra_state", "decoder.layers.20.mlp.linear_fc1._extra_state", "decoder.layers.20.mlp.linear_fc2._extra_state", "decoder.layers.21.self_attention.linear_proj._extra_state", "decoder.layers.21.self_attention.linear_qkv._extra_state", "decoder.layers.21.mlp.linear_fc1._extra_state", "decoder.layers.21.mlp.linear_fc2._extra_state", "decoder.layers.22.self_attention.linear_proj._extra_state", "decoder.layers.22.self_attention.linear_qkv._extra_state", "decoder.layers.22.mlp.linear_fc1._extra_state", "decoder.layers.22.mlp.linear_fc2._extra_state", "decoder.layers.23.self_attention.linear_proj._extra_state", "decoder.layers.23.self_attention.linear_qkv._extra_state", "decoder.layers.23.mlp.linear_fc1._extra_state", "decoder.layers.23.mlp.linear_fc2._extra_state", "decoder.layers.24.self_attention.linear_proj._extra_state", "decoder.layers.24.self_attention.linear_qkv._extra_state", "decoder.layers.24.mlp.linear_fc1._extra_state", "decoder.layers.24.mlp.linear_fc2._extra_state", "decoder.layers.25.self_attention.linear_proj._extra_state", "decoder.layers.25.self_attention.linear_qkv._extra_state", "decoder.layers.25.mlp.linear_fc1._extra_state", "decoder.layers.25.mlp.linear_fc2._extra_state", "decoder.layers.26.self_attention.linear_proj._extra_state", "decoder.layers.26.self_attention.linear_qkv._extra_state", "decoder.layers.26.mlp.linear_fc1._extra_state", "decoder.layers.26.mlp.linear_fc2._extra_state", "decoder.layers.27.self_attention.linear_proj._extra_state", "decoder.layers.27.self_attention.linear_qkv._extra_state", "decoder.layers.27.mlp.linear_fc1._extra_state", "decoder.layers.27.mlp.linear_fc2._extra_state", "decoder.layers.28.self_attention.linear_proj._extra_state", "decoder.layers.28.self_attention.linear_qkv._extra_state", "decoder.layers.28.mlp.linear_fc1._extra_state", "decoder.layers.28.mlp.linear_fc2._extra_state", "decoder.layers.29.self_attention.linear_proj._extra_state", "decoder.layers.29.self_attention.linear_qkv._extra_state", "decoder.layers.29.mlp.linear_fc1._extra_state", "decoder.layers.29.mlp.linear_fc2._extra_state", "decoder.layers.30.self_attention.linear_proj._extra_state", "decoder.layers.30.self_attention.linear_qkv._extra_state", "decoder.layers.30.mlp.linear_fc1._extra_state", "decoder.layers.30.mlp.linear_fc2._extra_state", "decoder.layers.31.self_attention.linear_proj._extra_state", "decoder.layers.31.self_attention.linear_qkv._extra_state", "decoder.layers.31.mlp.linear_fc1._extra_state", "decoder.layers.31.mlp.linear_fc2._extra_state".

llama3 -8b切分 torchrun ${DISTRIBUTED_ARGS} hf2mcore.py \ --load_path ${SOURCE_CKPT_PATH} \ --save_path ${TARGET_CKPT_PATH} \ --load ${HG_CKPT_PATH} \ --huggingface_model_path ${HG_CKPT_PATH} \ --megatron-path ${MEGATRON_PATH} \ --target_tensor_model_parallel_size ${TP} \ --target_pipeline_model_parallel_size ${PP} \ --micro-batch-size 1 \ --bf16 \ --swiglu \ --num-layers ${NUM_LAYERS} \ --hidden-size 4096 \ --ffn-hidden-size ${INTERMEDIATE_SIZE} \ --norm-epsilon 1e-5 \ --num-attention-heads 32 \ --max-position-embeddings 8192 \ --seq-length ${SEQ_LEN} \ --no-async-tensor-model-parallel-allreduce \ --patch-tokenizer-type LLamaTokenizer \ --extra-vocab-size ${EXTRA_VOCAB_SIZE} \ --untie-embeddings-and-output-weights \ --no-rope-fusion \ --use-rotary-position-embeddings \ --transformer-impl transformer_engine \ --disable-bias-linear \ --normalization RMSNorm \ --use-mcore-models \ --attention-dropout 0.0 \ --hidden-dropout 0.0 \ ${expert_options} \ ${convert_options} \ ${gqa_options}

divisionblur commented 2 months ago

似乎在转换权重的时候将extra_state pop出去的代码注释就行。这个extra_state应该不重要吧。

wuduher commented 2 months ago

似乎在转换权重的时候将extra_state pop出去的代码注释就行。这个extra_state应该不重要吧。

77d0b92b0490d2fdc5cd1a9abe2d3013 请问是这一行吗?我试了好像extra_state还是存在

wuduher commented 2 months ago

似乎在转换权重的时候将extra_state pop出去的代码注释就行。这个extra_state应该不重要吧。

应该是这个代码没有正确执行,_extra_state应该被pop掉的,我去找找是什么问题

aidenhe commented 2 months ago

load_checkpoint中的strict模式改为false

wuduher commented 2 months ago

load_checkpoint中的strict模式改为false