EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval
https://lmms-lab.github.io/
Other
1.02k stars 52 forks source link

qwenvl-7b evaluate refcoco|+|g cider and IOU are all None, #98

Open AderonHuang opened 3 weeks ago

AderonHuang commented 3 weeks ago

accelerate launch --main_process_port=29501 --num_processes=8 -m lmms_eval --model qwen_vl --model_args pretrained=/Qwen-VL/ --tasks refcoco,refcoco+,refcocog,refcoco_bbox_rec,refcoco+_bbox_rec,refcocog_bbox_rec --batch_size 1 --log_samples --log_samples_suffix qwenvl --output_path ./logs/

qwenvl-7b evaluate refcoco|+|g cider and IOU are all None

image

can you help to solving this problem?

Thank you for your contributions!

AderonHuang commented 3 weeks ago

@Luodian

AderonHuang commented 3 weeks ago

please check,thanks~@kcz358

Luodian commented 3 weeks ago

Can you check the logs you can see output samples and I guess it's because the model response doesn't match the groud-truth?

You could use --log_samples --log_samples_suffix=debug --output_path=./logs/ --verbosity=DEBUG to trigger the log mode.

AderonHuang commented 3 weeks ago

Can you check the logs you can see output samples and I guess it's because the model response doesn't match the groud-truth?

You could use --log_samples --log_samples_suffix=debug --output_path=./logs/ --verbosity=DEBUG to trigger the log mode.

LLava1.5 model evaluate to get right result, similar to paper, I think it it may not problem of the groud-truth. do you have evaluated the qwenvl-7B model for refcoco|+|g Rec tasks or cider task? Is it consistent with the Qwenvl paper? image

Luodian commented 3 weeks ago

We didnt match it with original paper on refcoco task. But we did match it with other task like AI2D, so I guess the model implementation is correct, but the prompting strategy is little different.

AderonHuang commented 3 weeks ago

We didnt match it with original paper on refcoco task. But we did match it with other task like AI2D, so I guess the model implementation is correct, but the prompting strategy is little different.

thansk for your reply. 1)If i want to match qwenvl with original paper on refcoco task, where and what should i change the prompting strategy to achieve results similar Qwenvl paper? 2)Are there other models like fuyu ect (except llava model) that have tested the refcoco task?

Luodian commented 3 weeks ago

1) I am not pretty much sure about the prompting strategy of Qwen-VL on this task. 2) We also did not test fuyu since of the prompting issue. Our most tested models are llava, intern-vl, and other commercial mdoels (GPT4V, Gemini, Claude etc) to show their best performance so that our evaluation toolkit can be used as standard to compare open models with commercial models.

AderonHuang commented 3 weeks ago
  1. I am not pretty much sure about the prompting strategy of Qwen-VL on this task.
  2. We also did not test fuyu since of the prompting issue. Our most tested models are llava, intern-vl, and other commercial mdoels (GPT4V, Gemini, Claude etc) to show their best performance so that our evaluation toolkit can be used as standard to compare open models with commercial models.

can you offer your llava1.5 rec task results in refcoco|+|g, this is our official llava1.5 evaluated result image

kcz358 commented 2 weeks ago

Hi @AderonHuang, here are the logs we run for refcoco+ using llava 1.5 7b. But we evaluate this quite a while ago.

refcoco+_bbox_testA.json

You can see the full score here

results (1).json

AderonHuang commented 1 week ago

Hi @AderonHuang, here are the logs we run for refcoco+ using llava 1.5 7b. But we evaluate this quite a while ago.

refcoco+_bbox_testA.json

You can see the full score here

results (1).json

thanks for your brilliant work. I had solved my problem.

AderonHuang commented 1 week ago

Hi @AderonHuang, here are the logs we run for refcoco+ using llava 1.5 7b. But we evaluate this quite a while ago.

refcoco+_bbox_testA.json

You can see the full score here

results (1).json

hi,I also hava another question, refcoco|+|g evaluation why cannot use batch size>1 hava inference, like this every time: accelerate launch --main_process_port=29501 --num_processes=8 -m lmms_eval --model llava --model_args pretrained=/path,use_flash_attention_2=False,device_map="" --tasks refcoco,refcoco_bbox_rec --batch_size 1 --log_samples if I want to achieve batch size >1 , what should I change?

kcz358 commented 6 days ago

Hi @AderonHuang , Batch size > 1 is not supported for original llava. If you want to evaluate llava using batch size > 1, you might want to take a look into llava_sglang

AderonHuang commented 6 days ago

Hi @AderonHuang , Batch size > 1 is not supported for original llava. If you want to evaluate llava using batch size > 1, you might want to take a look into llava_sglang

However, llava_sglang seems not surport multi-processes yet, some log like this, is right? accelerate launch --main_process_port=29501 --num_processes=8 -m lmms_eval --model llava_sglang --model_args pretrained=/path,use_flash_attention_2=False,device_map="" --tasks refcoco,refcoco_bbox_rec --batch_size 1 --log_samples "assert accelerator.num_processes == 1, "Llava-sglang does not support multi-processes yet"

kcz358 commented 6 days ago

Yes, multi process is not supported by sglang, but you can use tensor parallel to shard model into multiple gpus. This allow you to run batch sizes larger than gpu numbers and accelerate the evaluation pipeline. You might want to take a lot into this #54 for more details