Closed zjwu0522 closed 3 weeks ago
Great observation, we will fix it soon.
Thanks for the catch! This should come from an update I made a few days ago. I slightly changed the parsing for better consistency (i.e., all parsing outouts are string), but didn't rerun the sanity check.. Will fix it soon.
Thanks for the quick response! Appreciate your efforts on building this high quality benchmark. Looking forward to seeing the update! 🚀🚀
Thanks for the quick response! Appreciate your efforts on building this high quality benchmark. Looking forward to seeing the update! 🚀🚀
Thank you for your interest and nice words! It's fixed now in 1ae20ee
Thanks for the quick response! Appreciate your efforts on building this high quality benchmark. Looking forward to seeing the update! 🚀🚀
Thank you for your interest and nice words! It's fixed now in 1ae20ee
I can now obtain a perfect score for the sanity check. Everything looks great, so I'll close this issue. Thanks!
Description:
The evaluation benchmark fails to achieve a perfect score (1) on a sanity check run, where it should ideally match the ground truth answers exactly. The observed scores are slightly below perfect, indicating potential issues in parsing or score calculation.
overall_macro_mean_score
: 0.9977272727272727overall_micro_mean_score
: 0.9972439136426274Error Details
The benchmark fails specifically in the task
'waldo'
under thebounding_boxes
field. Multiple examples result in a score of 0, despite having access to the ground truth. Below is the log output indicating these issues.Full Log Output
Expected Outcome
The sanity check evaluation should yield:
overall_macro_mean_score
andoverall_micro_mean_score
should equal 1, indicating complete accuracy with ground truth answers.Impact
This issue may affect evaluation reliability, as even minor discrepancies could impact the benchmark's validity for accurately scoring tasks.
Steps to Reproduce