EvolvingLMMs-Lab / LongVA

Long Context Transfer from Language to Vision
Apache License 2.0
339 stars 18 forks source link

Failing to reproduce the paper result on videomme #33

Open joslefaure opened 2 weeks ago

joslefaure commented 2 weeks ago

I use the same script as in the Readme:

accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
    --model longva \
    --model_args pretrained=lmms-lab/LongVA-7B,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=32,model_name=llava_qwen \
    --tasks videomme \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix videomme_longva \
    --output_path ./logs/ 

With the latest in the latest commit of lmms_eval (main branch): bcbdc493

I get the following results:

Tasks Version Filter n-shot Metric Value Stderr
videomme Yaml none 0 videomme_perception_score 23.5185 ± N/A

Could you please advise on what I am doing wrong? Thanks

jzhang38 commented 1 week ago

We recently rerun the evaluation and the score is actually a bit higher than what is reported in the paper because some bugs was fixed for the videomme data in lmms-eval.

Can you check the log output by lmms-eval and see if there is any thing unusual?

joslefaure commented 4 days ago

Thanks for your reply. Upon inspecting the results, I found an alarming number of !!!!!!!!!!!!!!!! in the generated results for the following key resps: Ex: "resps": [["!!!!!!!!!!!!!!!!"]]. Do you have an idea what the issue might be?

I first installed lmms-eval and then installed longva following the official instructions for both projects.