tokenization mismatch - Githubissues

dingtine commented 2 months ago

When i trained llava-llama3 use your code, the log print tokenization mismatch as below. how to fix it? thanks!

WARNING: tokenization mismatch: 55 vs. 54. (ignored) WARNING: tokenization mismatch: 60 vs. 59. (ignored) WARNING: tokenization mismatch: 62 vs. 61. (ignored) WARNING: tokenization mismatch: 64 vs. 63. (ignored) WARNING: tokenization mismatch: 61 vs. 60. (ignored) WARNING: tokenization mismatch: 57 vs. 56. (ignored) WARNING: tokenization mismatch: 58 vs. 57. (ignored) WARNING: tokenization mismatch: 59 vs. 58. (ignored) WARNING: tokenization mismatch: 60 vs. 59. (ignored) WARNING: tokenization mismatch: 58 vs. 57. (ignored) WARNING: tokenization mismatch: 66 vs. 65. (ignored) WARNING: tokenization mismatch: 57 vs. 56. (ignored) WARNING: tokenization mismatch: 52 vs. 51. (ignored) WARNING: tokenization mismatch: 59 vs. 58. (ignored) WARNING: tokenization mismatch: 70 vs. 69. (ignored) WARNING: tokenization mismatch: 67 vs. 66. (ignored) WARNING: tokenization mismatch: 56 vs. 55. (ignored)

Victorwz commented 2 months ago

Hi, this is caused by the wrong usage of preprocess function. I have updated the latest script for fine-tuning LLaVA in scripts/finetune.sh. Please set the parameter of --version v3.

dingtine commented 2 months ago

Hi, this is caused by the wrong usage of preprocess function. I have updated the latest script for fine-tuning LLaVA in scripts/finetune.sh. Please set the parameter of --version v3.

Thans for rresponse, it solve my problem.

here, I have two questions to ask.

During the pretraining process, is it necessary to change the version from PLAIN to v3?
The author mentioned that their tokenizer is LlamaTokenizerFast, but when I switch to LlamaTokenizerFast, the log print "The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'."

Best thanks！

Victorwz commented 2 months ago

No need to do that. But you need to made some small modifications to PLAIN as the LLaMA3 tokenizer will not add bos token as the first token. I have made that change in the PLAIN preprocess function. You can check that.
You can load the tokenizer via llama3_tokenizer = LlamaTokenizerFast.from_pretrained instead of using AutoTokenizer class. However, I have not idea whether this will cause some bugs. The llama3 tokenizer class is quite different from Huggingface llama2 implementation. Thus, I add the bos and eos token manually.

dingtine commented 2 months ago

No need to do that. But you need to made some small modifications to PLAIN as the LLaMA3 tokenizer will not add bos token as the first token. I have made that change in the PLAIN preprocess function. You can check that.

You can load the tokenizer via llama3_tokenizer = LlamaTokenizerFast.from_pretrained instead of using AutoTokenizer class. However, I have not idea whether this will cause some bugs. The llama3 tokenizer class is quite different from Huggingface llama2 implementation. Thus, I add the bos and eos token manually.

Thank you, my big brother! :)

chiuwhsin commented 1 month ago

Hello @Victorwz,

I am facing the same "tokenization mismatch" issue, even after setting the --version v3 parameter. Here is the script I used, modified from finetune_lora.sh:

#!/bin/bash

deepspeed llava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path /path/to/my/local/Meta-Llama-3-8B-Instruct  \
    --version v3 \
    --data_path /path/to/my/local/train.json \
    --image_folder /path/to/my/local/image_folder \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter /clone/from/huggingface/provided/by/the/description/llava-v1.5-llama-3-8b-pretrain-clip-large-336px/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir /path/to/my/local/llava_llama3_8b_checkpoints \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --dataloader_num_workers 16 \
    --lazy_preprocess True \
    --report_to wandb

Note that I am using the LoRA version of the finetuning script on my own custom dataset. Any insights on what might be causing this issue would be greatly appreciated.

Victorwz commented 1 month ago

Hello @Victorwz,

I am facing the same "tokenization mismatch" issue, even after setting the --version v3 parameter. Here is the script I used, modified from finetune_lora.sh:

#!/bin/bash

deepspeed llava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path /path/to/my/local/Meta-Llama-3-8B-Instruct  \
    --version v3 \
    --data_path /path/to/my/local/train.json \
    --image_folder /path/to/my/local/image_folder \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter /clone/from/huggingface/provided/by/the/description/llava-v1.5-llama-3-8b-pretrain-clip-large-336px/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir /path/to/my/local/llava_llama3_8b_checkpoints \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --dataloader_num_workers 16 \
    --lazy_preprocess True \
    --report_to wandb

Note that I am using the LoRA version of the finetuning script on my own custom dataset. Any insights on what might be causing this issue would be greatly appreciated.

Hi, when did you git clone the llama-3 model. The huggingface team reconstructed the tokenizer after a week of original release of llama-3.

chiuwhsin commented 1 month ago

Hello @Victorwz , Thanks for the speedy response! It turns out that was the issue. Everything's running smoothly now. Thanks a bunch : )

Victorwz / LLaVA-Llama-3

tokenization mismatch #1