BAAI-DCAI / Bunny

A family of lightweight multimodal models.
Apache License 2.0
874 stars 66 forks source link

tokenization mismatch when fine-tuning bunny-phi3 #111

Closed simplelifetime closed 1 month ago

simplelifetime commented 1 month ago

I'm wondering what causes this error? Do I have to set --version phi3 during pre-training stage? I use --version plain in pre-train stage and --version phi3 in fine-tune stage. Is this the correct setting? If so, what's causing this error?

Isaachhh commented 1 month ago

"I use --version plain in pre-train stage and --version phi3 in fine-tune stage." Correct.

What is your training script?

simplelifetime commented 1 month ago

pretrain stage `#!/bin/bash

export HF_ENDPOINT=https://hf-mirror.com

MODEL_TYPE=phi-3 OUTPUT_DIR=bunny-$MODEL_TYPE-pretrain

mkdir -p data/zkl/checkpoints-pretrain-bunny/$OUTPUT_DIR

deepspeed --include localhost:0,1,2,3,4,5,6,7 --master_port 13111 bunny/train/train.py \ --deepspeed ./script/deepspeed/zero2.json \ --model_name_or_path data/zkl/hf_models/phi-3 \ --model_type $MODEL_TYPE \ --version plain \ --data_path data/zkl/blip_laion_cc_sbu_558k.json \ --image_folder data/zkl/llava_datasets/cc585k \ --vision_tower openai/clip-vit-large-patch14-336 \ --mm_projector_type mlp2x_gelu \ --tune_mm_mlp_adapter True \ --image_aspect_ratio square \ --bf16 True \ --output_dir data/zkl/checkpoints-pretrain-bunny/$OUTPUT_DIR \ --num_train_epochs 1 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 24000 \ --save_total_limit 1 \ --learning_rate 5e-4 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to none | tee 2>&1 data/zkl/checkpoints-pretrain-bunny/$OUTPUT_DIR/log.txt `

pretrain works fine for fine-tune stage, the script is like this

`

!/bin/bash

export HF_ENDPOINT=https://hf-mirror.com

MODEL_TYPE=phi-3

PRETRAIN_DIR=bunny-$MODEL_TYPE-pretrain OUTPUT_DIR=bunny-$MODEL_TYPE

mkdir -p data/zkl/checkpoints-sft-bunny/$OUTPUT_DIR

deepspeed --include localhost:0,1,2,3,4,5,6,7 --master_port 13111 bunny/train/train.py \ --deepspeed ./script/deepspeed/zero3.json \ --model_name_or_path data/zkl/hf_models/phi-3 \ --model_type $MODEL_TYPE \ --version phi3 \ --data_path data/zkl/mix_1m.json \ --image_folder data/zkl/llava_datasets \ --vision_tower openai/clip-vit-large-patch14-336 \ --pretrain_mm_mlp_adapter data/zkl/checkpoints-pretrain-bunny/$PRETRAIN_DIR/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --image_aspect_ratio pad \ --group_by_modality_length False \ --bf16 True \ --output_dir data/zkl/checkpoints-sft-bunny/$OUTPUT_DIR \ --num_train_epochs 1 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to none | tee 2>&1 data/zkl/checkpoints-sft-bunny//$OUTPUT_DIR/log.txt

`

It keeps printing tokenization mismatch during fine-tuning. I don't know that causes it

Isaachhh commented 1 month ago

What does data/zkl/hf_models/phi-3 come from? Phi-3-mini-4k-instruct?

simplelifetime commented 1 month ago

Yes, the tokenizer_config.json looks like this: { "add_bos_token": false, "add_eos_token": false, "added_tokens_decoder": { "0": { "content": "<unk>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true }, "1": { "content": "<s>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true }, "2": { "content": "</s>", "lstrip": false, "normalized": false, "rstrip": true, "single_word": false, "special": false }, "32000": { "content": "<|endoftext|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true }, "32001": { "content": "<|assistant|>", "lstrip": false, "normalized": false, "rstrip": true, "single_word": false, "special": true }, "32002": { "content": "<|placeholder1|>", "lstrip": false, "normalized": false, "rstrip": true, "single_word": false, "special": true }, "32003": { "content": "<|placeholder2|>", "lstrip": false, "normalized": false, "rstrip": true, "single_word": false, "special": true }, "32004": { "content": "<|placeholder3|>", "lstrip": false, "normalized": false, "rstrip": true, "single_word": false, "special": true }, "32005": { "content": "<|placeholder4|>", "lstrip": false, "normalized": false, "rstrip": true, "single_word": false, "special": true }, "32006": { "content": "<|system|>", "lstrip": false, "normalized": false, "rstrip": true, "single_word": false, "special": true }, "32007": { "content": "<|end|>", "lstrip": false, "normalized": false, "rstrip": true, "single_word": false, "special": true }, "32008": { "content": "<|placeholder5|>", "lstrip": false, "normalized": false, "rstrip": true, "single_word": false, "special": true }, "32009": { "content": "<|placeholder6|>", "lstrip": false, "normalized": false, "rstrip": true, "single_word": false, "special": true }, "32010": { "content": "<|user|>", "lstrip": false, "normalized": false, "rstrip": true, "single_word": false, "special": true } }, "bos_token": "<s>", "chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}{{'<|system|>\n' + message['content'] + '<|end|>\n'}}{% elif message['role'] == 'user' %}{{'<|user|>\n' + message['content'] + '<|end|>\n'}}{% elif message['role'] == 'assistant' %}{{'<|assistant|>\n' + message['content'] + '<|end|>\n'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|assistant|>\n' }}{% else %}{{ eos_token }}{% endif %}", "clean_up_tokenization_spaces": false, "eos_token": "<|endoftext|>", "legacy": false, "model_max_length": 4096, "pad_token": "<|endoftext|>", "padding_side": "left", "sp_model_kwargs": {}, "tokenizer_class": "LlamaTokenizer", "unk_token": "<unk>", "use_default_system_prompt": false }

Isaachhh commented 1 month ago

Try edit here to:

    if conversation_lib.default_conversation.version in {"bunny", "phi3"}:
        return preprocess_bunny(sources, tokenizer, has_image=has_image)
    elif conversation_lib.default_conversation.version in {"minicpm", "llama"}:
        return preprocess_bunny_with_bos(sources, tokenizer, has_image=has_image)
simplelifetime commented 1 month ago

Thanks! That solves my problem

Isaachhh commented 1 month ago

@simplelifetime

This is because Microsoft upgraded Phi-3 20 days ago and modified the tokenizer about bos_token. (commit).

And our code is based on the original Phi-3.