I cloned the "lmms-lab/LLaVA-NeXT-Interleave-Bench" dataset and "llava-onevision-qwen2-7b-ov" checkpoint from Huggingface to reproduce the results of the paper, but some benchmark results seem to be very different (e.g. IEI, qbench, 3D-Chat, MathVerse, SciVerse). What could be the reason for this?
I cloned the "lmms-lab/LLaVA-NeXT-Interleave-Bench" dataset and "llava-onevision-qwen2-7b-ov" checkpoint from Huggingface to reproduce the results of the paper, but some benchmark results seem to be very different (e.g. IEI, qbench, 3D-Chat, MathVerse, SciVerse). What could be the reason for this?