Failing to reproduce the paper result on videomme

joslefaure commented 2 weeks ago

I use the same script as in the Readme:

accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
    --model longva \
    --model_args pretrained=lmms-lab/LongVA-7B,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=32,model_name=llava_qwen \
    --tasks videomme \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix videomme_longva \
    --output_path ./logs/

With the latest in the latest commit of lmms_eval (main branch): bcbdc493

I get the following results:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
videomme	Yaml	none	0	videomme_perception_score	↑	23.5185	±	N/A

Could you please advise on what I am doing wrong? Thanks

jzhang38 commented 1 week ago

We recently rerun the evaluation and the score is actually a bit higher than what is reported in the paper because some bugs was fixed for the videomme data in lmms-eval.

Can you check the log output by lmms-eval and see if there is any thing unusual?

joslefaure commented 4 days ago

Thanks for your reply. Upon inspecting the results, I found an alarming number of !!!!!!!!!!!!!!!! in the generated results for the following key resps: Ex: "resps": [["!!!!!!!!!!!!!!!!"]]. Do you have an idea what the issue might be?

I first installed lmms-eval and then installed longva following the official instructions for both projects.

EvolvingLMMs-Lab / LongVA

Failing to reproduce the paper result on videomme #33