TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models
https://arxiv.org/abs/2402.14289
Apache License 2.0
658 stars 68 forks source link

Fine-tuning TinyLLaVA-1.5B on custom text-VQA dataset. #31

Closed soham-joshi closed 7 months ago

soham-joshi commented 8 months ago

I want to LLoRA fine-tune TinyLLaVA-1.5B on a custom text-VQA dataset. Could you help me with:

  1. Dataset Format: how does the data format look like (e.g., the questions-answer pairs do they need to follow a certain format)?
  2. As of now, there are a couple of finetune.sh scripts in the project: (i)scripts/tiny_llava/finetune/finetune.sh and (ii)scripts/tiny_llava/finetune.sh

Could you please clarify which script shall be used for Fine-tuning TinyLLaVA-1.5B using LLoRA? @baichuanzhou @huangleiBuaa @eltociear @jiajunlong

Thanks!

baichuanzhou commented 8 months ago

Sure I have pushed the script of using lora to finetune our model in the dev branch here. In the dev branch, you can also find instructions for custom finetuning here. I hope this solves your problem.

soham-joshi commented 8 months ago

Okay thank you for your response. @baichuanzhou

soham-joshi commented 7 months ago

I tried fine-tuning (without LLoRA) TinyLLaVA-1.5B on my custom dataset with the instructions in this script. However, while evaluating the same fine-tuned model, I see that the model repository hasn't saved 'mm_projector.bin' weights and thus, I am unable to evaluate the fine-tuned model. Could you help me here, please?

@baichuanzhou @huangleiBuaa @eltociear @jiajunlong

Thanks!

baichuanzhou commented 7 months ago

Can I see your training script?

soham-joshi commented 7 months ago

Sure,

`### Extracted from https://github.com/DLCV-BUAA/TinyLLaVABench/blob/dev/scripts/tiny_llava/finetune/finetune_lora.sh

Finetuning TinyLLaVA-1.5B (conv mode v1) full end-to-end

!/bin/bash

DATA_PATH="data/tvqa-instruct-cleaned-12k-sentence-answer.json" IMAGE_PATH="data/images/" OUTPUT_DIR="TinyLLaVA_logs/TinyLLaVA-1.5B-full_ft_TextCaps/"

deepspeed tinyllava/train/train.py \ --deepspeed ./scripts/tiny_llava/zero3.json \ --model_name_or_path bczhou/TinyLLaVA-1.5B \ --version v1 \ --data_path $DATA_PATH \ --image_folder $IMAGE_PATH \ --vision_tower bczhou/TinyLLaVA-1.5B-SigLIP \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length False \ --fp16 True \ --output_dir $OUTPUT_DIR \ --num_train_epochs 5 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 5000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 False \ --model_max_length 3072 \ --gradient_checkpointing True \ --dataloader_num_workers 15 \ --lazy_preprocess True \ --report_to wandb `

Let me know if there's anything which I missed while fine-tuning. Thanks!

baichuanzhou commented 7 months ago

By setting tune_mm_mlp_adapter to True will you be able to only tune the mlp adapter, which will allow you to save the mm_projector.bin. Otherwise, you will tune both the adapter and the LLM.

soham-joshi commented 7 months ago

Okay, understood @baichuanzhou .

For evaluating, I am using the following script, which expects the mm_projector.bin file.

Getting this error. image

my Eval script:


#!/bin/bash

MODEL_PATH="/mnt/nasfolder/imt2018072/LLMs/TinyLLaVA_logs/TinyLLaVA-1.5B-full_ft_TextCaps/"
MODEL_NAME="TinyLLaVA-1.5B-full_ft_TextCaps"
EVAL_DIR="./playground/data/eval"

python -m tinyllava.eval.model_vqa_loader \
    --model-path $MODEL_PATH \
    --question-file $EVAL_DIR/textvqa/llava_textvqa_val_v051_ocr.jsonl \
    --image-folder $EVAL_DIR/textvqa/train_images \
    --answers-file $EVAL_DIR/textvqa/answers/$MODEL_NAME.jsonl \
    --temperature 0 \
    --model-base "bczhou/TinyLLaVA-1.5B" \
    --conv-mode v1

python -m tinyllava.eval.eval_textvqa \
    --annotation-file $EVAL_DIR/textvqa/TextVQA_0.5.1_val.json \
    --result-file $EVAL_DIR/textvqa/answers/$MODEL_NAME.jsonl
soham-joshi commented 7 months ago

I think it is because I am giving an argument --model-base and --conv-mode arguments.

baichuanzhou commented 7 months ago

If you give --model-base as an argument, and your name does not contain 'LoRA' in it, the load_pretrained_model function in builder.py will look for mm_projector.bin. Since I assume you did not pass --tune_mm_mlp_adapter as an argument during finetuning, you do not need to pass model_base during evaluation.

soham-joshi commented 7 months ago

Yes, got it, thank you for the clarification and the prompt responses! Closing this issue. @baichuanzhou