With 4-bit quantization model in huggingface, I could not reproduce the mme performance in my environment.
What is the most important thing to product the performance?
image padding (I used simply resized image to 490 490 but this paper used padding to left, right, top, bottom)
prompt (I used answer the question using a single word or a phrase but this paper used answer the question briefly)
generation hyperparameter (i used just greedy search (num_beams=1, temperature=1, do_sample=False), but this paper used beam search including num_beams=5, temperature=1.0)
With 4-bit quantization model in huggingface, I could not reproduce the mme performance in my environment.
What is the most important thing to product the performance?