Open jungle-gym-ac opened 1 week ago
Hi, I think that this is because right now our implementation of llava append multi image tokens for llava instead of just one image token. In v0.1.0, we just append one image token to the question for seedbench.
If you changed batched_visuals
to flattened_visuals
here then you would get the same result as v0.1.0 of seedbench
Here are my result after the changes
Thanks!I read the code you provided. So in v0.2.0, all 8 images from a video in seed-bench-video are fed as input. LLaVA-NeXT, which is trained with anyres
technique, can handle this multi-image input. But LLaVA-1.5 which doesn't adopt anyres
training, cannot handle multi-image sequence. Am I understanding this correctly?
I think we can add a if
statement here about the model version and then adopt the appropriate preprocessing(single/multi-image) here, which would be compatible with different LLaVA versions.
Maybe like this:
Thanks!I read the code you provided. So in v0.2.0, all 8 images from a video in seed-bench-video are fed as input. LLaVA-NeXT, which is trained with
anyres
technique, can handle this multi-image input. But LLaVA-1.5 which doesn't adoptanyres
training, cannot handle multi-image sequence. Am I understanding this correctly?
Yes, I think this is why LLaVA-1.5 can not handle multi-images correctly.
I think we can add a
if
statement here about the model version and then adopt the appropriate
I feel current implementation is the most correct way to reflect the ability of the model on video evaluation. In early version, only the first image was passed to the model because there is only one image token. I think in official repo, the author concat all frames of video into one image. Here we opt not to do so and choose to test the model in multi-images way.
I evaluated LLaVA-1.5 model (BOTH the official checkpoint, and the checkpoint trained by myself), on SeedBench. Here are the results:
The
seed_image
score is consistent with the LLaVA Paper and the Google Sheet you provided, but theseed_video
score is ZERO, resulting in a lowseed_all
score as well. How can I get the expected results on SEED_VIDEO?