Thank you for the amazing project! We're currently working on fine-tuning the DeepSeek model and followed the instructions in your README. However, after transforming the weights, we encountered the following error message:
RuntimeError: Error(s) in loading state_dict for GPTModel:
size mismatch for embedding.word_embeddings.weight: copying a param with shape torch.Size([102400, 2048]) from checkpoint, the shape in current model is torch.Size([102416, 2048]).
size mismatch for output_layer.weight: copying a param with shape torch.Size([102400, 2048]) from checkpoint, the shape in current model is torch.Size([102416, 2048]).
INFO:megatron.core.optimizer:Setting up optimizer with OptimizerConfig(optimizer='adam', lr=1e-05, min_lr=1e-06, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, pa
rams_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9
, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers ob
ject at 0x7f1ca43bfe80>)
> learning rate decay style: cosine
loading release checkpoint from /mnt/deepseek-ckpts/DeepSeek-Coder-V2-Lite-Instruct-to-mcore-tp1-pp1-ep4
Traceback (most recent call last):
File "/mnt/task_wrapper/user_output/artifacts/Pai-Megatron-Patch/examples/deepseek_v2/pretrain_deepseek.py", line 222, in <module>
pretrain(train_valid_test_datasets_provider,
File "/mnt/task_wrapper/user_output/artifacts/Pai-Megatron-Patch/Megatron-LM-240405/megatron/training/training.py", line 236, in pretrain
model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
File "/mnt/task_wrapper/user_output/artifacts/Pai-Megatron-Patch/Megatron-LM-240405/megatron/training/training.py", line 518, in setup_model_and_optimizer
args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
File "/mnt/task_wrapper/user_output/artifacts/Pai-Megatron-Patch/Megatron-LM-240405/megatron/training/checkpointing.py", line 718, in load_checkpoint
model[0].load_state_dict(state_dict['model'], strict=strict)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GPTModel:
size mismatch for embedding.word_embeddings.weight: copying a param with shape torch.Size([102400, 2048]) from checkpoint, the shape in current model is torch.Size([102416, 2048]).
size mismatch for output_layer.weight: copying a param with shape torch.Size([102400, 2048]) from checkpoint, the shape in current model is torch.Size([102416, 2048]).
Thank you for the amazing project! We're currently working on fine-tuning the DeepSeek model and followed the instructions in your README. However, after transforming the weights, we encountered the following error message:
Command
Full error log