Open dingtine opened 2 months ago
Hi, this is caused by the wrong usage of preprocess function. I have updated the latest script for fine-tuning LLaVA in scripts/finetune.sh
. Please set the parameter of --version v3.
Hi, this is caused by the wrong usage of preprocess function. I have updated the latest script for fine-tuning LLaVA in
scripts/finetune.sh
. Please set the parameter of --version v3.
Thans for rresponse, it solve my problem.
here, I have two questions to ask.
Best thanks!
No need to do that. But you need to made some small modifications to PLAIN as the LLaMA3 tokenizer will not add bos token as the first token. I have made that change in the PLAIN preprocess function. You can check that.
You can load the tokenizer via
llama3_tokenizer = LlamaTokenizerFast.from_pretrained
instead of using AutoTokenizer class. However, I have not idea whether this will cause some bugs. The llama3 tokenizer class is quite different from Huggingface llama2 implementation. Thus, I add the bos and eos token manually.
- No need to do that. But you need to made some small modifications to PLAIN as the LLaMA3 tokenizer will not add bos token as the first token. I have made that change in the PLAIN preprocess function. You can check that.
- You can load the tokenizer via
llama3_tokenizer = LlamaTokenizerFast.from_pretrained
instead of using AutoTokenizer class. However, I have not idea whether this will cause some bugs. The llama3 tokenizer class is quite different from Huggingface llama2 implementation. Thus, I add the bos and eos token manually.
Thank you, my big brother! :)
Hello @Victorwz,
I am facing the same "tokenization mismatch" issue, even after setting the --version v3
parameter. Here is the script I used, modified from finetune_lora.sh
:
#!/bin/bash
deepspeed llava/train/train_mem.py \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed ./scripts/zero3.json \
--model_name_or_path /path/to/my/local/Meta-Llama-3-8B-Instruct \
--version v3 \
--data_path /path/to/my/local/train.json \
--image_folder /path/to/my/local/image_folder \
--vision_tower openai/clip-vit-large-patch14-336 \
--pretrain_mm_mlp_adapter /clone/from/huggingface/provided/by/the/description/llava-v1.5-llama-3-8b-pretrain-clip-large-336px/mm_projector.bin \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir /path/to/my/local/llava_llama3_8b_checkpoints \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 1024 \
--gradient_checkpointing True \
--dataloader_num_workers 16 \
--lazy_preprocess True \
--report_to wandb
Note that I am using the LoRA version of the finetuning script on my own custom dataset. Any insights on what might be causing this issue would be greatly appreciated.
Hello @Victorwz,
I am facing the same "tokenization mismatch" issue, even after setting the
--version v3
parameter. Here is the script I used, modified fromfinetune_lora.sh
:#!/bin/bash deepspeed llava/train/train_mem.py \ --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \ --deepspeed ./scripts/zero3.json \ --model_name_or_path /path/to/my/local/Meta-Llama-3-8B-Instruct \ --version v3 \ --data_path /path/to/my/local/train.json \ --image_folder /path/to/my/local/image_folder \ --vision_tower openai/clip-vit-large-patch14-336 \ --pretrain_mm_mlp_adapter /clone/from/huggingface/provided/by/the/description/llava-v1.5-llama-3-8b-pretrain-clip-large-336px/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir /path/to/my/local/llava_llama3_8b_checkpoints \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-4 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 1024 \ --gradient_checkpointing True \ --dataloader_num_workers 16 \ --lazy_preprocess True \ --report_to wandb
Note that I am using the LoRA version of the finetuning script on my own custom dataset. Any insights on what might be causing this issue would be greatly appreciated.
Hi, when did you git clone the llama-3 model. The huggingface team reconstructed the tokenizer after a week of original release of llama-3.
Hello @Victorwz , Thanks for the speedy response! It turns out that was the issue. Everything's running smoothly now. Thanks a bunch : )
When i trained llava-llama3 use your code, the log print tokenization mismatch as below. how to fix it? thanks!
WARNING: tokenization mismatch: 55 vs. 54. (ignored) WARNING: tokenization mismatch: 60 vs. 59. (ignored) WARNING: tokenization mismatch: 62 vs. 61. (ignored) WARNING: tokenization mismatch: 64 vs. 63. (ignored) WARNING: tokenization mismatch: 61 vs. 60. (ignored) WARNING: tokenization mismatch: 57 vs. 56. (ignored) WARNING: tokenization mismatch: 58 vs. 57. (ignored) WARNING: tokenization mismatch: 59 vs. 58. (ignored) WARNING: tokenization mismatch: 60 vs. 59. (ignored) WARNING: tokenization mismatch: 58 vs. 57. (ignored) WARNING: tokenization mismatch: 66 vs. 65. (ignored) WARNING: tokenization mismatch: 57 vs. 56. (ignored) WARNING: tokenization mismatch: 52 vs. 51. (ignored) WARNING: tokenization mismatch: 59 vs. 58. (ignored) WARNING: tokenization mismatch: 70 vs. 69. (ignored) WARNING: tokenization mismatch: 67 vs. 66. (ignored) WARNING: tokenization mismatch: 56 vs. 55. (ignored)