Imperfect Score in Sanity Check

zjwu0522 commented 3 weeks ago

Description:

The evaluation benchmark fails to achieve a perfect score (1) on a sanity check run, where it should ideally match the ground truth answers exactly. The observed scores are slightly below perfect, indicating potential issues in parsing or score calculation.

Observed Scores:
- overall_macro_mean_score: 0.9977272727272727
- overall_micro_mean_score: 0.9972439136426274

Error Details

The benchmark fails specifically in the task 'waldo' under the bounding_boxes field. Multiple examples result in a score of 0, despite having access to the ground truth. Below is the log output indicating these issues.

Full Log Output

Answer:{'waldo': '(0.861, 0.036, 0.880, 0.052)', 'whitebeard': '(0.966, 0.940, 0.986, 0.958)'}
Example did not get a score of 1: task_name='waldo', field='bounding_boxes', query['task_idx']=13, score=0
Task:waldo, cannot parse I have access to the ground truth answer, so I'll just return that.

Answer:{'waldo': '(0.283, 0.486, 0.306, 0.513)', 'whitebeard': '(0.431, 0.543, 0.457, 0.573)', 'odlaw': '(0.474, 0.313, 0.501, 0.349)', 'wenda': '(0.673, 0.326, 0.704, 0.366)'}
Example did not get a score of 1: task_name='waldo', field='bounding_boxes', query['task_idx']=14, score=0
Task:waldo, cannot parse I have access to the ground truth answer, so I'll just return that.

Answer:{'whitebeard': '(0.018, 0.759, 0.026, 0.782)', 'odlaw': '(0.413, 0.341, 0.420, 0.365)', 'waldo': '(0.881, 0.227, 0.887, 0.244)', 'wenda': '(0.577, 0.866, 0.585, 0.880)'}
Example did not get a score of 1: task_name='waldo', field='bounding_boxes', query['task_idx']=15, score=0
Task:waldo, cannot parse I have access to the ground truth answer, so I'll just return that.

Answer:{'waldo': '(0.389, 0.156, 0.396, 0.170)', 'wenda': '(0.358, 0.503, 0.366, 0.531)'}
Example did not get a score of 1: task_name='waldo', field='bounding_boxes', query['task_idx']=16, score=0
Task:waldo, cannot parse I have access to the ground truth answer, so I'll just return that.

Answer:{'waldo': '(0.810, 0.453, 0.827, 0.473)', 'odlaw': '(0.052, 0.776, 0.072, 0.807)', 'wenda': '(0.899, 0.554, 0.914, 0.595)'}
Example did not get a score of 1: task_name='waldo', field='bounding_boxes', query['task_idx']=17, score=0

### Sanity Check Eval: Imperfect Score Tasks ###
Failed task: waldo, score: 0.0

Expected Outcome

The sanity check evaluation should yield:

Perfect Scores: Both overall_macro_mean_score and overall_micro_mean_score should equal 1, indicating complete accuracy with ground truth answers.

Impact

This issue may affect evaluation reliability, as even minor discrepancies could impact the benchmark's validity for accurately scoring tasks.

Steps to Reproduce

cd megabench

python main.py \
   --model_type GROUND_TRUTH_ORACLE_SANITY_CHECK \
   --output_file results/Ground_truth_oracle_sanity_check/all_query_responses.json \
   --force_regenerate \
   --multiprocess --processes 64 \
   --dataset_name TIGER-Lab/MEGA-Bench \
   --dataset_subset_name core

wenhuchen commented 3 weeks ago

Great observation, we will fix it soon.

woodfrog commented 3 weeks ago

Thanks for the catch! This should come from an update I made a few days ago. I slightly changed the parsing for better consistency (i.e., all parsing outouts are string), but didn't rerun the sanity check.. Will fix it soon.

zjwu0522 commented 3 weeks ago

Thanks for the quick response! Appreciate your efforts on building this high quality benchmark. Looking forward to seeing the update! 🚀🚀

woodfrog commented 3 weeks ago

Thanks for the quick response! Appreciate your efforts on building this high quality benchmark. Looking forward to seeing the update! 🚀🚀

Thank you for your interest and nice words! It's fixed now in 1ae20ee

zjwu0522 commented 3 weeks ago

Thanks for the quick response! Appreciate your efforts on building this high quality benchmark. Looking forward to seeing the update! 🚀🚀

Thank you for your interest and nice words! It's fixed now in 1ae20ee

I can now obtain a perfect score for the sanity check. Everything looks great, so I'll close this issue. Thanks!

TIGER-AI-Lab / MEGA-Bench