llama2-7B AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256) #81

yushengsu-thu commented 3 months ago

I follow here and use the same arguemnts: https://epfllm.github.io/Megatron-LLM/guide/getting_started.html

When I training,

LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50" TRAIN_ARGS="--train_iters 6500 --lr_decay_style cosine --lr_warmup_iters 650 --lr 2e-5 --min_lr 2e-6" DISTRIBUTED_ARGS="--nproc_per_node $NUMBER_OF_GPUS_for_EACH_NODE --nnodes $NUMBER_OF_NODES --node_rank $NODE_ID --master_addr localhost --master_port 6000" torchrun $DISTRIBUTED_ARGS ../finetune.py \ --tensor_model_parallel_size 2 \ --pipeline_model_parallel_size 1 \ --load $LLM_LOAD_DIR \ --save $LLM_SAVE_DIR \ --tensorboard_dir $TENSORBOARD_DIR \ --data_path $DATA_DIR \ --model_name llama2 \ --tokenizer_type SentencePieceTokenizer \ --vocab_file $VOCAB_PREFIX/tokenizer.model \ --bf16 \ --use_flash_attn \ --micro_batch_size 8 \ --global_batch_size 64 \ --sequence_parallel \ --recompute_granularity selective \ --use_checkpoint_args \ --data_type instruction \ --variable_seq_lengths \ --vocab_extra_ids_list "<|im_start|>,<|im_end|>" \ $COMMON_ARGS $LOG_ARGS $TRAIN_ARGS $LLAMA_ARGS

I encountered the following problem: "llama2-7B AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256) #81"

When I shared the model, I used 32000 true_vocab_size as you said in the torrential link (I also tried to remove it) but I still encounter the same error.

VOCAB_SIZE=32000 python3 ../tools/checkpoint_util.py \ --target_tensor_parallel_size 2 \ --target_pipeline_parallel_size 1 \ --load_dir $LLM_LOAD_DIR \ --save_dir $LLM_SAVE_SHARDED_DIR \ --model_type llama2 \ --true_vocab_size $VOCAB_SIZE \ --bf16

Gannn12138 commented 1 month ago

I have the same problem

LinglingGreat commented 6 hours ago

add --no_new_tokens to your args

epfLLM / Megatron-LLM

llama2-7B AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256) #81 #103