epfLLM / Megatron-LLM

distributed trainer for LLMs
Other
504 stars 73 forks source link

llama2-7B AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256) #81

Closed 13416157913 closed 6 months ago

13416157913 commented 9 months ago

hello, I run finetune llama2-7B meet the error: Traceback (most recent call last): File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 261, in pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder, File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 108, in pretrain model, optimizer, opt_param_scheduler = _setup_model_and_optimizer( File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 371, in _setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler) File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 603, in load_checkpoint check_checkpoint_args(checkpoint_args) File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 57, in check_checkpoint_args _compare('padded_vocab_size') File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 49, in _compare assert checkpoint_value == args_value, error_message AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256).

================================================================================ this is my script:

export CUDA_DEVICE_MAX_CONNECTIONS=1 LOG_ARGS="--log_interval 1 --save_interval 10 --eval_interval 10" TRAIN_ARGS="--train_iters 10 --lr_decay_style cosine --lr_warmup_iters 5 --lr 3e-4 --min_lr 1e-6" DISTRIBUTED_ARGS="--nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000" COMMON_ARGS="--num_layers 32 --num_attention_heads 32 --seq_length 4096 --max_position_embeddings 4096 --ffn_hidden_size 11008 --hidden_dropout 0.0 --position_embedding_type rotary --no_bias_gelu_fusion --no_bias_dropout_fusion --use_checkpoint_args --attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5 --layernorm_epsilon 1e-6 --weight_decay 0.1 --sequence_parallel --recompute_activations --recompute_granularity selective --log_timers_to_tensorboard --rope_scaling_factor 1.0"

--vocab_file=/home/dengkaibiao/Llama-2-7b-hf/tokenizer.model \

export CUDA_VISIBLE_DEVICES=1,2 torchrun $DISTRIBUTED_ARGS finetune.py \ --tensor_model_parallel_size 2 \ --pipeline_model_parallel_size 1 \ --load /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2 \ --save /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2 \ --tensorboard_dir /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2/tensorboard/ \ --data_path /home/dengkaibiao/Megatron-LLM/corpus_indexed/china_text_document \ --split 100,0,0 \ --model_name llama2 \ --tokenizer_type SentencePieceTokenizer \ --vocab_file=/home/dengkaibiao/Llama-2-7b-hf/tokenizer.model \ --make_vocab_size_divisible_by 1 \ --bf16 \ --global_batch_size 128 \ --micro_batch_size 1 \ --use_flash_attn \ $COMMON_ARGS $LOG_ARGS $TRAIN_ARGS

AleHD commented 8 months ago

The padded_vocab_size might have been modified when sharding the weights. Did you specify --true_vocab_size? What command did you use to shard the weights?