Closed alexwangmac closed 12 months ago
Same question. I got MME perception 1019, cognition 243, and MM-Vet 28.5 Here is my setting:
deepspeed llava/train/train_mem.py \
--deepspeed ./scripts/zero3.json \
--model_name_or_path ./ckpts/vicuna-7b-v1.5 \
--version v1 \
--data_path ./data_dir/other_instruction/lvis_instruct4v_220k.json\
--image_folder ./data_dir/ \
--vision_tower openai/clip-vit-large-patch14-336 \
--pretrain_mm_mlp_adapter ./ckpts/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir ./ckpts/llava-v1.5-7b_instruct4v_220k \
--run_name llava-v1.5-7b_instruct4v_220k \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
Hi all, thanks for your interest in our work. To achieve satisfactory results on the QA benchmarks, you should mix our LVIS-INSTRUCT4V with the data from academic tasks (see Table 1 & 7 of LLaVA 1.5 paper). Note that the results in our paper are all trained on a mixture of LVIS-INSTRUCT4V and these data, as clarified in Sec. 4.1.
We also provide the mixed data here.
Hi thank you for your reply! Another question is that, I mix your dataset LVIS-INSTRUCT4V-220k with llava_v1_5_mix665k which I got total 885k samples, then the finetuning results are not better than only using llava_v1_5_mix665k(baseline) and also not good as your report. need help~
LVIS-INSTRUCT4V-220k+llava_v1_5_mix665k: MME: perception 1458, cognition 314 MMB: 0.658 MMB-cn: 0.606 gqa: 59% pope: 0.849 sqa: 69% textvqa: 57.7%
llava1.5-7b reproduction: MME: perception 1490, cognition 309 MMB: 0.658 MMB-cn: 0.606 gqa: 62.5% pope: 0.85 sqa: 68.57% textvqa: 58.82%
setting: default finetune.sh setting
Hi, the results are weird. Could you please share your experiment settings (e.g., pytorch version, GPUs) and loss curve with us? In addition, I think there are two possible ways to debug your problem efficiently (1) evaluate the results on these benchmarks with our provided checkpoint first (2) train on the data split that we released today (mixture of LVIS-INSTRUCT4V and academic data, you can also mix the LLaVA-150K, during our experiments, we append the LLaVA-150K after our LVIS-INSTRUCT4V and before academic data).
I finetune LLaVA-v1.5-7b with your LVIS-INSTRUCT4V_mix730k, and I got MME perception score 1525, which is consistent with your result, but MM-VET score is 30.6, which is much less than your result (34.6). Here is my setting:
deepspeed llava/train/train_mem.py \
--deepspeed ./scripts/zero3.json \
--model_name_or_path ./ckpts/vicuna-7b-v1.5 \
--version v1 \
--data_path ./data_dir/other_instruction/lvis_instruct4v_mix730k.json\
--image_folder ./data_dir/ \
--vision_tower openai/clip-vit-large-patch14-336 \
--pretrain_mm_mlp_adapter ./ckpts/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir ./ckpts/llava-v1.5-7b_instruct4v_mix730k \
--run_name llava-v1.5-7b_instruct4v_mix730k \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
I use the conda environment that LLaVA-v1.5 official repo provides, and 4 A100-80G. btw, I can reproduce your results with the checkpoint you released. It's really weird :(
Hi, the results are weird. Could you please share your experiment settings (e.g., pytorch version, GPUs) and loss curve with us? In addition, I think there are two possible ways to debug your problem efficiently (1) evaluate the results on these benchmarks with our provided checkpoint first (2) train on the data split that we released today (mixture of LVIS-INSTRUCT4V and academic data, you can also mix the LLaVA-150K, during our experiments, we append the LLaVA-150K after our LVIS-INSTRUCT4V and before academic data).
Hi, thank you for your advice, I've tried the ways you suggested. Regarding the evaluation, I used the LVIS-Instruct4v-7b checkpoint, and the results almost matched the ones you provided: MME 1469.3 (yours 1472.9), MMB 67.9 (yours 67.1), gqa 62.6 (yours 62.6), pope 84.6 (yours 84), sqa 68.3 (yours 70.3), textvqa 57.5 (yours 57.6). This indicates that my evaluation pipeline is working correctly.
Furthermore, I compared fine-tuning with your dataset and the llava_mix665k dataset, and the results were nearly equivalent:
Also I attempted fine-tuning on lvis_instruct4v_mix730k + llava150k, but the results were not satisfactory. Here are the results: MME 1502.5, MMB 65.8, gqa 58.8, pope 85.1, sqa 69.3, textvqa 57.23.
Here are my experiment settings:
GPU: A100 Driver Version: 510.73.08 CUDA Version: 11.8 accelerate==0.24.1 flash-attn==2.3.2 peft==0.5.0 transformers==4.31.0 torch==2.0.1 torchvision==0.15.2 deepspeed==0.10.1
And here is my fine-tuning script:
deepspeed llava/train/train_mem.py \ --deepspeed ./scripts/zero3.json \ --model_name_or_path ./weights/vicuna-7b-v1.5/ \ --version v1 \ --data_path root_dir/lvis_instruct4v_mix730k.json \ --image_folder root_dir/images \ --vision_tower openai/clip-vit-large-patch14-336 \ --pretrain_mm_mlp_adapter weights/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ${output_dir} \ --num_train_epochs 1 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to wandb
Hi, thanks for your patience, and we did a thorough investigation last week and below are our findings:
(1) the official repo of LLaVA has a potential bug that would affect the ''group_by_modality_length'' argument (see https://github.com/haotian-liu/LLaVA/issues/857). In our experiments, we used an old version (commit id: e61aa3f88f58f8e871b9c2476d743724e271c776). We also tried the newest code and found that some results are lower compared to models trained with the older code.
(2) ``Ours'’ results in the paper are trained using 619K instructions instead of 730K. For each image, we obtain two sets of instructions (conversations and details from GPT-4V, see pseudocode for instruction data generation in the paper). So we have a total of 110K conversation instructions and 110K detailed description instructions. 619K instructions contain 110K conversations and benchmark data, while there are additional 110K detailed descriptions in 730K. The results are similar on most benchmarks (730K slightly better). On LLaVa-Bench though, using 619K produces better results. "Ours-mixLLaVA" in the paper are trained using 880K data (730K + 150K from LLaVa-Instruct)
(3) We also fixed mistakes when evaluating SQA (we mistakenly reported overall acc instead of img acc) and MMB-CN (The LLaVa repo will produce mmb/7b.xlsx and mmb-cn/7b.xlsx with the same file name, and we found that the evaluation server produces wrong numbers if we submit a file immediately after the submission of another file with the same name, not identical numbers but pretty close, a new file name is needed for correct result. We filed the bug to the organizer.)
(4) About reproduction: we tried multiple times last week with the old code, and we are able to reproduce the results consistently using 8 A100 (we did not try 4 A100 though).
(5) Compared to LLaVa-1.5, we found that the use of Instruct-4V is more helpful on benchmarks like MME, MM-Vet and LLaVa-bench, which are designed for evaluating multimodal language models. On traditional QA benchmarks, the gains are relatively minor.
We will update the paper accordingly. Finally, we apologize for all the mistakes and confusion.
Hi, I got poor results finetuning your dataset on LLaVA1.5, I used the default training setting of LLaVA1.5, why is that? Could you please provide your setting?
only use LVIS-INSTRUCT4V: MME: perception 915, cognition 244 gqa: 0.6% very low pope: 0.73 sqa: 42.1% textvqa:2.84% very low
llava1.5-7b reproduction: MME: perception 1490, cognition 309 gqa: 62.5% pope: 0.85 sqa: 68.57% textvqa: 58.82%
setting: default
finetune.sh
setting