EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval
https://lmms-lab.github.io/
Other
1.04k stars 57 forks source link

The score is zero for `seedbench/seed_video` [Bug] #132

Open jungle-gym-ac opened 1 week ago

jungle-gym-ac commented 1 week ago

I evaluated LLaVA-1.5 model (BOTH the official checkpoint, and the checkpoint trained by myself), on SeedBench. Here are the results:

The seed_image score is consistent with the LLaVA Paper and the Google Sheet you provided, but the seed_video score is ZERO, resulting in a low seed_all score as well. How can I get the expected results on SEED_VIDEO?

kcz358 commented 6 days ago

Hi, I think that this is because right now our implementation of llava append multi image tokens for llava instead of just one image token. In v0.1.0, we just append one image token to the question for seedbench.

If you changed batched_visuals to flattened_visuals here then you would get the same result as v0.1.0 of seedbench

https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/7c208b76640c986cfe94233dce735c3ca4ad4319/lmms_eval/models/llava.py#L338-L348

Here are my result after the changes

image
jungle-gym-ac commented 5 days ago

Thanks!I read the code you provided. So in v0.2.0, all 8 images from a video in seed-bench-video are fed as input. LLaVA-NeXT, which is trained with anyres technique, can handle this multi-image input. But LLaVA-1.5 which doesn't adopt anyres training, cannot handle multi-image sequence. Am I understanding this correctly?

I think we can add a if statement here about the model version and then adopt the appropriate preprocessing(single/multi-image) here, which would be compatible with different LLaVA versions. Maybe like this:

image
kcz358 commented 4 days ago

Thanks!I read the code you provided. So in v0.2.0, all 8 images from a video in seed-bench-video are fed as input. LLaVA-NeXT, which is trained with anyres technique, can handle this multi-image input. But LLaVA-1.5 which doesn't adopt anyres training, cannot handle multi-image sequence. Am I understanding this correctly?

Yes, I think this is why LLaVA-1.5 can not handle multi-images correctly.

I think we can add a if statement here about the model version and then adopt the appropriate

I feel current implementation is the most correct way to reflect the ability of the model on video evaluation. In early version, only the first image was passed to the model because there is only one image token. I think in official repo, the author concat all frames of video into one image. Here we opt not to do so and choose to test the model in multi-images way.