Open AderonHuang opened 3 weeks ago
@Luodian
please check,thanks~@kcz358
Can you check the logs you can see output samples and I guess it's because the model response doesn't match the groud-truth?
You could use --log_samples --log_samples_suffix=debug --output_path=./logs/ --verbosity=DEBUG
to trigger the log mode.
Can you check the logs you can see output samples and I guess it's because the model response doesn't match the groud-truth?
You could use
--log_samples --log_samples_suffix=debug --output_path=./logs/ --verbosity=DEBUG
to trigger the log mode.
LLava1.5 model evaluate to get right result, similar to paper, I think it it may not problem of the groud-truth. do you have evaluated the qwenvl-7B model for refcoco|+|g Rec tasks or cider task? Is it consistent with the Qwenvl paper?
We didnt match it with original paper on refcoco task. But we did match it with other task like AI2D, so I guess the model implementation is correct, but the prompting strategy is little different.
We didnt match it with original paper on refcoco task. But we did match it with other task like AI2D, so I guess the model implementation is correct, but the prompting strategy is little different.
thansk for your reply. 1)If i want to match qwenvl with original paper on refcoco task, where and what should i change the prompting strategy to achieve results similar Qwenvl paper? 2)Are there other models like fuyu ect (except llava model) that have tested the refcoco task?
1) I am not pretty much sure about the prompting strategy of Qwen-VL on this task.
2) We also did not test fuyu
since of the prompting issue. Our most tested models are llava, intern-vl, and other commercial mdoels (GPT4V, Gemini, Claude etc) to show their best performance so that our evaluation toolkit can be used as standard to compare open models with commercial models.
- I am not pretty much sure about the prompting strategy of Qwen-VL on this task.
- We also did not test
fuyu
since of the prompting issue. Our most tested models are llava, intern-vl, and other commercial mdoels (GPT4V, Gemini, Claude etc) to show their best performance so that our evaluation toolkit can be used as standard to compare open models with commercial models.
can you offer your llava1.5 rec task results in refcoco|+|g, this is our official llava1.5 evaluated result
Hi @AderonHuang, here are the logs we run for refcoco+ using llava 1.5 7b. But we evaluate this quite a while ago.
You can see the full score here
Hi @AderonHuang, here are the logs we run for refcoco+ using llava 1.5 7b. But we evaluate this quite a while ago.
You can see the full score here
thanks for your brilliant work. I had solved my problem.
Hi @AderonHuang, here are the logs we run for refcoco+ using llava 1.5 7b. But we evaluate this quite a while ago.
You can see the full score here
hi,I also hava another question, refcoco|+|g evaluation why cannot use batch size>1 hava inference, like this every time: accelerate launch --main_process_port=29501 --num_processes=8 -m lmms_eval --model llava --model_args pretrained=/path,use_flash_attention_2=False,device_map="" --tasks refcoco,refcoco_bbox_rec --batch_size 1 --log_samples if I want to achieve batch size >1 , what should I change?
Hi @AderonHuang , Batch size > 1 is not supported for original llava. If you want to evaluate llava using batch size > 1, you might want to take a look into llava_sglang
Hi @AderonHuang , Batch size > 1 is not supported for original llava. If you want to evaluate llava using batch size > 1, you might want to take a look into
llava_sglang
However, llava_sglang seems not surport multi-processes yet, some log like this, is right? accelerate launch --main_process_port=29501 --num_processes=8 -m lmms_eval --model llava_sglang --model_args pretrained=/path,use_flash_attention_2=False,device_map="" --tasks refcoco,refcoco_bbox_rec --batch_size 1 --log_samples "assert accelerator.num_processes == 1, "Llava-sglang does not support multi-processes yet"
Yes, multi process is not supported by sglang, but you can use tensor parallel to shard model into multiple gpus. This allow you to run batch sizes larger than gpu numbers and accelerate the evaluation pipeline. You might want to take a lot into this #54 for more details
accelerate launch --main_process_port=29501 --num_processes=8 -m lmms_eval --model qwen_vl --model_args pretrained=/Qwen-VL/ --tasks refcoco,refcoco+,refcocog,refcoco_bbox_rec,refcoco+_bbox_rec,refcocog_bbox_rec --batch_size 1 --log_samples --log_samples_suffix qwenvl --output_path ./logs/
qwenvl-7b evaluate refcoco|+|g cider and IOU are all None
can you help to solving this problem?
Thank you for your contributions!