ScienceQA-指标低太多了

zhangbaijin commented 4 months ago

非常感谢你的开源工作，但是在下载你的pretrained model，碰到了bug，因此我下载了llava-7b-lora的model，跟issue5的碰到的一样，测试scienceqa的命令行如下：

python -m llava.eval.model_vqa_science \
    --model-base /mnt/xiaofeng.zxf/models/vicuna-7b-v1.5 \
    --model-path /mnt/xiaofeng.zxf/code/LLaVA-PruMerge/llava-v1.5-7b-lora \
    --question-file /mnt/xiaofeng.zxf/llava_test_CQM-A_image.json \
    --image-folder /mnt/xiaofeng.zxf/ScienceQA_DATA/test \
    --answers-file ./llava-v1.5-7b.jsonl \
    --temperature 0 \
    --conv-mode vicuna_v1

但是结果测出来却只有IMG-Accuracy: 9.27%，你的pretrained-model是正确的吗

42Shawn commented 4 months ago

IMG-Accuracy: 9.27%太低了，random guess都有这个数了。你应该是eval弄错了

zhangbaijin commented 4 months ago

一开始下载了你给的pretrained model，结果测试报错，然后看到了issue5的相同问题，换成了原本llava-1.5-lora的pretrained model，测出来是这个结果，而且测试时间要40几分钟，很奇怪，所以确认一下你给的模型是完整的吗

mu-cai commented 3 months ago

Hi, we updated the script. Are you able to reproduce results now?

liuxiaozhu01 commented 2 months ago

Hi, @42Shawn @mu-cai I just encountered a similar question. I tried to evaluate on TestVQA without fine-tuning. The scripts I run is scripts/v1_5/eval/testvqa.sh

CUDA_VISIBLE_DEVICES=0 python -m llava.eval.model_vqa_loader \
    --model-base /root/home/workspace/LLM/vicuna/lmsys/vicuna-7b-v1.5 \
    --model-path /root/home/workspace/LLM/llava/llava-v1.5-7b \
    --question-file ./playground/data/eval/textvqa/llava_textvqa_val_v051_ocr.jsonl \
    --image-folder ./playground/data/eval/textvqa/train_images \
    --answers-file ./playground/data/eval/textvqa/answers/llava-v1.5-7b.jsonl \
    --temperature 0 \
    --conv-mode vicuna_v1

python -m llava.eval.eval_textvqa \
    --annotation-file ./playground/data/eval/textvqa/TextVQA_0.5.1_val.json \
    --result-file ./playground/data/eval/textvqa/answers/llava-v1.5-7b.jsonl

--model-base /root/home/workspace/LLM/vicuna/lmsys/vicuna-7b-v1.5 is the vicuna-7b model downloaded from huggingface, and --model-path /root/home/workspace/LLM/llava/llava-v1.5-7b is the original llava checkpoint downloaded from here. The output is shown below

Loading LLaVA from base model...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:53<00:00, 26.54s/it]
Some weights of LlavaLlamaForCausalLM were not initialized from the model checkpoint at /root/home/workspace/LLM/vicuna/lmsys/vicuna-7b-v1.5 and are newly initialized: ['model.mm_projector.2.weight', 'model.mm_projector.2.bias', 'model.mm_projector.0.bias', 'model.mm_projector.0.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/root/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/root/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 32000. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
  0%|                                                                                                                                                               | 0/5000 [00:00<?, ?it/s]/root/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/root/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:367: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `None` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [1:03:37<00:00,  1.31it/s]
llava-v1.5-7b
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 2149.71it/s]
Samples: 5000
Accuracy: 5.29%

Anyone could help? Thanks in advanced

liuxiaozhu01 commented 2 months ago

Well. When --model-base is canceled, the results become normal. Once --model-base is assigned, it will load the LLM params from model_base. it seems that applying origin LLM params(from lmsys/vicuna-7b-v1.5 here) is harmful to the performance of LLaVA and it also slows down inference speed. WHY?

42Shawn / LLaVA-PruMerge

ScienceQA-指标低太多了 #7