Closed liyucheng09 closed 10 months ago
dataset version mode yi-6b-hf - - qwen-7b-hf - - llama-2-7b-hf - -
---------------------------------------------- --------- ------ ---------------- ----------------------------- --------------------------------------- ---------------- ----------------------------- --------------------------------------- ---------------- ----------------------------- ---------------------------------------
- - - accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated accuracy - clean accuracy - input contaminated accuracy - input-and-label contaminated
ceval-computer_network 9b9417 ppl 54.55 50.00 83.33 54.55 50.00 50.00 18.18 0.00 66.67
ceval-operating_system b2b8cf ppl 57.14 nan 80.00 42.86 nan 60.00 21.43 nan 40.00
ceval-computer_architecture 1bd275 ppl 85.71 100.00 100.00 85.71 100.00 58.33 14.29 0.00 41.67
ceval-college_programming 2d0833 ppl 63.64 0.00 85.71 45.45 0.00 71.43 31.82 0.00 14.29
ceval-college_physics fb7e04 ppl 50.00 25.00 66.67 16.67 0.00 11.11 33.33 75.00 22.22
ceval-college_chemistry 916b7d ppl 47.62 100.00 50.00 57.14 100.00 50.00 19.05 0.00 50.00
ceval-advanced_mathematics 5cad2a ppl 42.11 nan nan 21.05 nan nan 21.05 nan nan
ceval-probability_and_statistics a6b30e ppl 27.78 nan nan 33.33 nan nan 27.78 nan nan
ceval-discrete_mathematics 68be68 ppl 42.86 0.00 0.00 28.57 0.00 0.00 21.43 100.00 0.00
ceval-electrical_engineer 056c2e ppl 44.44 75.00 80.00 22.22 50.00 46.67 33.33 25.00 26.67
ceval-metrology_engineer 4a757a ppl 75.00 50.00 78.57 50.00 100.00 78.57 25.00 0.00 50.00
ceval-high_school_mathematics a8ed21 ppl 33.33 nan nan 16.67 nan nan 33.33 nan nan
ceval-high_school_physics e1fc86 ppl 66.67 50.00 80.00 75.00 50.00 60.00 8.33 50.00 20.00
ceval-high_school_chemistry 9021c6 ppl 56.25 nan 100.00 75.00 nan 33.33 12.50 nan 66.67
ceval-high_school_biology c7f5a1 ppl 44.44 nan 90.00 55.56 nan 90.00 11.11 nan 30.00
ceval-middle_school_mathematics 213989 ppl 33.33 100.00 100.00 26.67 100.00 100.00 6.67 100.00 33.33
ceval-middle_school_biology ce0420 ppl 100.00 nan 90.91 100.00 nan 90.91 40.00 nan 54.55
ceval-middle_school_physics 78f3af ppl 57.14 100.00 90.91 71.43 100.00 90.91 42.86 0.00 54.55
ceval-middle_school_chemistry d071d2 ppl 91.67 nan 100.00 91.67 nan 100.00 8.33 nan 25.00
ceval-veterinary_medicine cd3a07 ppl 61.54 nan 90.00 46.15 nan 70.00 46.15 nan 30.00
ceval-college_economics a35346 ppl 57.89 75.00 62.50 52.63 50.00 43.75 5.26 25.00 40.62
ceval-business_administration 69dd6a ppl 61.54 100.00 77.78 53.85 0.00 66.67 30.77 50.00 50.00
ceval-marxism 283ce0 ppl 100.00 100.00 87.50 80.00 100.00 75.00 40.00 0.00 50.00
ceval-mao_zedong_thought f38cd1 ppl 100.00 nan 94.44 66.67 nan 72.22 33.33 nan 55.56
ceval-education_science fbd65c ppl 81.82 100.00 82.35 72.73 100.00 76.47 27.27 100.00 52.94
ceval-teacher_qualification c77f1f ppl 94.44 100.00 95.65 83.33 50.00 95.65 61.11 0.00 21.74
ceval-high_school_politics bbac37 ppl 92.86 50.00 100.00 100.00 100.00 100.00 50.00 50.00 33.33
ceval-high_school_geography 730a30 ppl 63.64 nan 100.00 81.82 nan 87.50 36.36 nan 37.50
ceval-middle_school_politics 15b2d7 ppl 90.00 nan 0.00 90.00 nan 100.00 50.00 nan 0.00
ceval-middle_school_geography b00167 ppl 100.00 100.00 100.00 66.67 100.00 100.00 0.00 0.00 25.00
ceval-modern_chinese_history 5a04cd ppl 75.00 nan 86.67 75.00 nan 86.67 62.50 nan 13.33
ceval-ideological_and_moral_cultivation 0829ff ppl 80.00 nan 100.00 100.00 nan 78.57 80.00 nan 42.86
ceval-logic c9c394 ppl 86.67 nan 57.14 60.00 nan 57.14 33.33 nan 14.29
ceval-law cbd3c5 ppl 53.33 66.67 83.33 26.67 0.00 50.00 33.33 33.33 50.00
ceval-chinese_language_and_literature 716ab3 ppl 46.15 100.00 66.67 38.46 0.00 55.56 30.77 0.00 33.33
ceval-art_studies 476114 ppl 71.43 nan 68.42 64.29 nan 63.16 28.57 nan 52.63
ceval-professional_tour_guide 70f30f ppl 100.00 100.00 88.24 90.00 50.00 64.71 50.00 50.00 29.41
ceval-legal_professional f19cf5 ppl 71.43 50.00 71.43 57.14 50.00 42.86 35.71 50.00 28.57
ceval-high_school_chinese 931614 ppl 83.33 nan 75.00 91.67 nan 75.00 33.33 nan 25.00
ceval-high_school_history 4d6364 ppl 75.00 66.67 100.00 83.33 100.00 100.00 50.00 66.67 40.00
ceval-middle_school_history 7f6356 ppl 90.91 100.00 100.00 100.00 100.00 100.00 18.18 0.00 33.33
ceval-civil_servant a5dcb8 ppl 68.42 80.00 58.82 73.68 80.00 41.18 26.32 20.00 17.65
ceval-sports_science 192553 ppl 100.00 100.00 55.56 62.50 50.00 44.44 62.50 50.00 22.22
ceval-plant_protection f7ff86 ppl 66.67 100.00 88.89 66.67 0.00 77.78 41.67 0.00 33.33
ceval-basic_medicine a95a09 ppl 77.78 nan 90.00 77.78 nan 60.00 11.11 nan 30.00
ceval-clinical_medicine 664b54 ppl 78.57 0.00 100.00 71.43 0.00 42.86 14.29 0.00 42.86
ceval-urban_and_rural_planner fdae6f ppl 78.57 100.00 53.33 64.29 0.00 73.33 42.86 0.00 20.00
ceval-accountant d810a1 ppl 70.59 71.43 84.00 52.94 42.86 68.00 35.29 28.57 12.00
ceval-fire_engineer bb924d ppl 41.67 0.00 83.33 41.67 0.00 72.22 25.00 0.00 38.89
ceval-environmental_impact_assessment_engineer d59200 ppl 76.19 50.00 75.00 57.14 0.00 75.00 33.33 0.00 25.00
ceval-tax_accountant 9e16f2 ppl 64.52 nan 77.78 35.48 nan 61.11 22.58 nan 22.22
ceval-physician 0e90d5 ppl 79.17 100.00 70.83 62.50 100.00 58.33 29.17 100.00 29.17
ceval-humanities - ppl 74.42 75.00 82.14 67.44 50.00 70.54 37.98 41.67 33.93
ceval-stem - ppl 53.70 57.14 85.61 47.41 52.38 67.63 23.70 33.33 36.69
ceval-social-science - ppl 81.60 84.62 83.09 76.00 61.54 72.79 36.80 30.77 41.18
ceval-other - ppl 72.31 73.91 75.00 58.46 39.13 61.88 30.77 21.74 25.00
ceval-hard - ppl 44.35 37.50 70.00 41.13 25.00 30.00 21.77 62.50 30.00
ceval - ppl 67.32 71.01 81.17 58.97 49.28 67.82 30.46 30.43 33.82
This is an example of how the results look like adding contamination analysis supported.
We could compare the performance on the clean and dirty sets and have a direct impression on to what extent models make use of memorisation instead of exhibiting true generalisation capability.
Closing because as discussed on Discord, the best way to integrate with this tool would be for it to upload a dataset cleaned/contaminated copy to HF hub and then one can run the task pointing to that dataset copy.
With our --use_cache
feature this should not expand inference runtimes at all.
Data contamination is an urgent crisis in the evaluation community. There are many concerns about contamination issues in LLMs' evaluations from people in the media and reviewers on OpenReview.
I have developed a tool to purify existing benchmarks and assess the true capability of LLMs.
More specifically, I categorise test samples into three subsets with search engine:
By testing LLMs on the clean set, we shall be able to avoid the impact of memorisation in the evaluation. And by comparing the performance on the clean and contaminated set, we can have an impression on how serious existing evaluation was affected by data contamination.
Check the performance of Llama-2 70B as an example:
We saw large model like Llama-2 70B is very good at utilise contaminated test samples to achieve higher metric.
For more info, check this repo and this pre-print.
Let me know if you have any plan about addressing the data contamination issue in the big-refactor branch.