InternLM / InternLM-XComposer

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Apache License 2.0
2.47k stars 153 forks source link

reproducibility for mme evaluation #169

Closed ByungKwanLee closed 7 months ago

ByungKwanLee commented 7 months ago

With 4-bit quantization model in huggingface, I could not reproduce the mme performance in my environment.

What is the most important thing to product the performance?

  1. image padding (I used simply resized image to 490 490 but this paper used padding to left, right, top, bottom)
  2. prompt (I used answer the question using a single word or a phrase but this paper used answer the question briefly)
  3. generation hyperparameter (i used just greedy search (num_beams=1, temperature=1, do_sample=False), but this paper used beam search including num_beams=5, temperature=1.0)
myownskyW7 commented 7 months ago

The original fp16 model rather is necessary to reproduce the benchmark results. Please use the code here. https://github.com/InternLM/InternLM-XComposer/blob/main/evaluation/mme/eval.py