bfshi / scaling_on_scales

When do we not need larger vision models?
MIT License
257 stars 7 forks source link

How to infer Llava with S^2? #13

Open Gumpest opened 2 weeks ago

Gumpest commented 2 weeks ago

It seems not work.

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0

python3 -m llava.eval.model_vqa_loader \
    --model-path bfshi/llava-v1.5-7b-s2-lora \
    --model-base liuhaotian/llava-v1.5-7b \
    --question-file ./playground/data/eval/MME/llava_mme.jsonl \
    --image-folder ./playground/data/eval/MME/MME_Benchmark_release_version \
    --answers-file ./playground/data/eval/MME/answers/llava-v1.5-s2.jsonl \
    --temperature 0 \
    --conv-mode vicuna_v1

cd ./playground/data/eval/MME

python3 convert_answer_to_mme.py --experiment llava-v1.5-s2

cd eval_tool

python3 calculation.py --results_dir answers/llava-v1.5-s2
bfshi commented 2 weeks ago

Can you try using lmsys/vicuna-7b-v1.5 as the base model?

Gumpest commented 2 weeks ago

Thanks a lot! I have trained the LLaVA with S^2 w/o Lora, and its accuracy on textvqa is higher (43.99-->45.72). However, the inference time costs 4x (08:38 --> 37:57). I wonder about that. @bfshi

bfshi commented 2 weeks ago

Interesting. When you say the accuracy is higher and inference is slower, are you comparing to llava w/ S2 and w/ lora, or llava w/o S2? If you are comparing with llava w/ S2 and w/ Lora, then using Lora or not shouldn't affect inference speed. One possible reason is the model w/o Lora you trained tends to output longer responses than the model w/ Lora. This will make the time of answering each question in textvqa longer and increase the inference time. Maybe worth to check that if that's the reason?