NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[QUESTION] how to configure llama3 model #845

Closed ltm920716 closed 4 weeks ago

ltm920716 commented 1 month ago

Hi, I have run the gpt2 demo successfully by sh examples/pretrain_gpt.sh, and I want to build the llama3-8b model through Megatron-LM. So I change the params in examples/pretrain_gpt.sh like bellow:

GPT_ARGS="
    --num-layers 2 \
    --hidden-size 4096 \
    --num-attention-heads 32 \
    --seq-length 512 \
    --max-position-embeddings 8192 \
    --ffn-hidden-size 14336 \
    --micro-batch-size 1 \
    --global-batch-size 2 \
    --lr 0.00015 \
    --train-iters 5000 \
    --lr-decay-iters 3200 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction .01 \
    --num-query-groups 8 \
    --group-query-attention \
    --fp16 \
    --use-rotary-position-embeddings \
    --normalization RMSNorm \
    --no-position-embedding \
    --attention-softmax-in-fp32
"

I also add code snippet in 'pretrain_gpt.py' to show the model layers as follow:

for name, param in model.named_parameters():
        print(f"{name} {param.shape}")

the output result is

language_model.embedding.word_embeddings.weight torch.Size([128000, 4096])
language_model.encoder.layers.0.self_attention.layernorm_qkv.layer_norm_weight torch.Size([4096])
language_model.encoder.layers.0.self_attention.layernorm_qkv.weight torch.Size([12288, 4096])
language_model.encoder.layers.0.self_attention.layernorm_qkv.bias torch.Size([12288])
language_model.encoder.layers.0.self_attention.proj.weight torch.Size([4096, 4096])
language_model.encoder.layers.0.self_attention.proj.bias torch.Size([4096])
language_model.encoder.layers.0.layernorm_mlp.layer_norm_weight torch.Size([4096])
language_model.encoder.layers.0.layernorm_mlp.fc1_weight torch.Size([14336, 4096])
language_model.encoder.layers.0.layernorm_mlp.fc1_bias torch.Size([14336])
language_model.encoder.layers.0.layernorm_mlp.fc2_weight torch.Size([4096, 14336])
language_model.encoder.layers.0.layernorm_mlp.fc2_bias torch.Size([4096])
language_model.encoder.layers.1.self_attention.layernorm_qkv.layer_norm_weight torch.Size([4096])
language_model.encoder.layers.1.self_attention.layernorm_qkv.weight torch.Size([12288, 4096])
language_model.encoder.layers.1.self_attention.layernorm_qkv.bias torch.Size([12288])
language_model.encoder.layers.1.self_attention.proj.weight torch.Size([4096, 4096])
language_model.encoder.layers.1.self_attention.proj.bias torch.Size([4096])
language_model.encoder.layers.1.layernorm_mlp.layer_norm_weight torch.Size([4096])
language_model.encoder.layers.1.layernorm_mlp.fc1_weight torch.Size([14336, 4096])
language_model.encoder.layers.1.layernorm_mlp.fc1_bias torch.Size([14336])
language_model.encoder.layers.1.layernorm_mlp.fc2_weight torch.Size([4096, 14336])
language_model.encoder.layers.1.layernorm_mlp.fc2_bias torch.Size([4096])
language_model.encoder.final_norm.weight torch.Size([4096])

I think the qkv part is not correct, right?

the params:

--num-query-groups 8 
--group-query-attention 

have no effect, please help, thanks!

by the way, I have converted the llama3-8b hf to megatron, the converted model layers are:

word_embeddings torch.Size([128256, 4096])
layers.0.input_norm.weight torch.Size([4096])
layers.0.self_attention.query_key_value.weight torch.Size([6144, 4096])
layers.0.self_attention.dense.weight torch.Size([4096, 4096])
layers.0.post_attention_norm.weight torch.Size([4096])
layers.0.mlp.dense_h_to_4h.weight torch.Size([28672, 4096])
layers.0.mlp.dense_4h_to_h.weight torch.Size([4096, 14336])
layers.1.input_norm.weight torch.Size([4096])
layers.1.self_attention.query_key_value.weight torch.Size([6144, 4096])
layers.1.self_attention.dense.weight torch.Size([4096, 4096])
layers.1.post_attention_norm.weight torch.Size([4096])
layers.1.mlp.dense_h_to_4h.weight torch.Size([28672, 4096])
layers.1.mlp.dense_4h_to_h.weight torch.Size([4096, 14336])
layers.2.input_norm.weight torch.Size([4096])
......
final_norm.weight torch.Size([4096])
weight torch.Size([128256, 4096])
arktoswb commented 1 month ago

by the way, I have converted the llama3-8b hf to megatron, the converted model layers are

My understanding is that megatron model type is deprecated. Consider using mcore model type and --use-mcore-models when doing training.

ltm920716 commented 4 weeks ago

My understanding is that megatron model type is deprecated. Consider using mcore model type and --use-mcore-models when doing training.

hi thanks, --use-mcore-models is useful,but something like ffn-gate and so on is still matter,I will go to nemo-framework-lancher and see the difference