Question about mantis-eval matching criteria

azshue commented 5 months ago

Hi,

Thank you for open-sourcing this great work. I appreciate the team's efforts in putting this together.

I have a question about the evaluation criteria in the mantis-eval, "short-answer" question specifically. It looks like the correctness of "short-answer" is judged by exact match between model's output and the reference answer, ~without further parsing~(see the edit below). But the prompt template for this type of question also instructs the model to output both analysis and final answer.

In this case, I noticed that a model would give the correct answer (for example, "Yes") followed by some reasoning, but such an answer wouldn't be counted as correct because of how the exact match works.

Could you help me understand why it's written like this? Does it make sense to improve the matching rule? Thanks.

Edit: I just saw that there is parsing on the model's output that only takes the outputs after "Final Answer: ". This makes much more sense. However, I noticed that sometimes a model would answer correctly but with more than one word. Do you think it makes sense to loosen the matching criteria? Alternatively, I think it also makes sense to make the instruction more clear in the prompt template, for example, by adding one more sentence like "Answer the question in a single word or phrase."

wenhuchen commented 4 months ago

Thanks for the suggestion. This makes a lot of sense. I think the Boolean and numerical ones can be matched more flexibly. It should boost the final score a bit.

jdf-prog commented 4 months ago

@azshue Thanks for raising the concern. Yeah, our written exact matching still have some space to improve. We do encourage you to modify the parsing script a bit to make it better and more reasonable.

Besides, it's also worth to notice that short answer question only occupy a small portion of Mantis-Eval (7.8% exactly, see hugging face dataset viewer statistics), so the performance boosting will somehow be limited a bit. However, better exact matching rules are still welcomed.

TIGER-AI-Lab / Mantis

Question about mantis-eval matching criteria #7