alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
723 stars 103 forks source link

DeepSeek Vocab-size Mismatch #338

Open Jiayi-Pan opened 2 months ago

Jiayi-Pan commented 2 months ago

Thank you for the amazing project! We're currently working on fine-tuning the DeepSeek model and followed the instructions in your README. However, after transforming the weights, we encountered the following error message:

RuntimeError: Error(s) in loading state_dict for GPTModel:
        size mismatch for embedding.word_embeddings.weight: copying a param with shape torch.Size([102400, 2048]) from checkpoint, the shape in current model is torch.Size([102416, 2048]).
        size mismatch for output_layer.weight: copying a param with shape torch.Size([102400, 2048]) from checkpoint, the shape in current model is torch.Size([102416, 2048]).

Command

cd /mnt/task_wrapper/user_output/artifacts/Pai-Megatron-Patch/examples/deepseek_v2
sh run_finetune_deepseek.sh  \
dsw \
A2.4B \
1    \
8    \
1e-5   \
1e-6   \
128  \
128  \
bf16  \
1   \
1  \
4 \
sel \
true \
true \
true \
100  \
/mnt/deepseek-datasets/alpaca_zh-train.json   \
/mnt/deepseek-datasets/alpaca_zh-valid.json   \
/mnt/deepseek-ckpts/DeepSeek-Coder-V2-Lite-Instruct-to-mcore-tp1-pp1-ep4 \
100000   \
10000   \
/mnt/deepseek-ckpts/test_ft

Full error log

INFO:megatron.core.optimizer:Setting up optimizer with OptimizerConfig(optimizer='adam', lr=1e-05, min_lr=1e-06, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, pa
rams_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9
, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers ob
ject at 0x7f1ca43bfe80>)
> learning rate decay style: cosine
 loading release checkpoint from /mnt/deepseek-ckpts/DeepSeek-Coder-V2-Lite-Instruct-to-mcore-tp1-pp1-ep4
Traceback (most recent call last):
  File "/mnt/task_wrapper/user_output/artifacts/Pai-Megatron-Patch/examples/deepseek_v2/pretrain_deepseek.py", line 222, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/mnt/task_wrapper/user_output/artifacts/Pai-Megatron-Patch/Megatron-LM-240405/megatron/training/training.py", line 236, in pretrain
    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
  File "/mnt/task_wrapper/user_output/artifacts/Pai-Megatron-Patch/Megatron-LM-240405/megatron/training/training.py", line 518, in setup_model_and_optimizer
    args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
  File "/mnt/task_wrapper/user_output/artifacts/Pai-Megatron-Patch/Megatron-LM-240405/megatron/training/checkpointing.py", line 718, in load_checkpoint
    model[0].load_state_dict(state_dict['model'], strict=strict)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GPTModel:
        size mismatch for embedding.word_embeddings.weight: copying a param with shape torch.Size([102400, 2048]) from checkpoint, the shape in current model is torch.Size([102416, 2048]).
        size mismatch for output_layer.weight: copying a param with shape torch.Size([102400, 2048]) from checkpoint, the shape in current model is torch.Size([102416, 2048]).
jerryli1981 commented 1 month ago

您好,感谢哈。我们刚刚对DeepSeek-V2进行了一次关键升级,看看是否还有问题,如果还存在的话可以在新版上重新提个PR谢谢:https://github.com/alibaba/Pai-Megatron-Patch/pull/355