Unreasonably high scores in llava_in_the_wild benchmark

wusize commented 2 months ago

I tested Llava-1.5-hf under the llava_in_the_wild benchmark and got unreasonably high scores (96.1) compared to that reported in the paper. Any idea of the cause of it?

kcz358 commented 2 months ago

Somehow we notice that you got 100 for both detail and complex. This is likely that you query gpt4 and get an error so we auto fill it with [-1, -1]. Otherwise it is very hard to explain why the model only get 74.9 in conv if it is really so good.

Also to be noticed that it is almost impossible to replicate the exact score from the report now as gpt-0314 is deprecated and different version will have different taste. If you want to make sure your reponse is almost the same with the origin one instead of the hf one, we recommend you to use the origin one since they have some slightly different implementation detail.

wusize commented 2 months ago

Thanks for the quick reply! I believe it's because of the error in calling openai API.

EvolvingLMMs-Lab / lmms-eval

Unreasonably high scores in llava_in_the_wild benchmark #64