EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.91k stars 1.85k forks source link

[New Feature] Addressing Data Contamination in Evaluation Benchmarks #1056

Closed liyucheng09 closed 10 months ago

liyucheng09 commented 11 months ago

Data contamination is an urgent crisis in the evaluation community. There are many concerns about contamination issues in LLMs' evaluations from people in the media and reviewers on OpenReview.

I have developed a tool to purify existing benchmarks and assess the true capability of LLMs.

More specifically, I categorise test samples into three subsets with search engine:

  1. clean
  2. input contaminated: the exactly same question appears online.
  3. input-and-label contaminated: both question and answer appear online.

By testing LLMs on the clean set, we shall be able to avoid the impact of memorisation in the evaluation. And by comparing the performance on the clean and contaminated set, we can have an impression on how serious existing evaluation was affected by data contamination.

Check the performance of Llama-2 70B as an example:

Dataset Condition Llama-2 70B
MMLU Clean .6763
MMLU Input-only contaminated .6667 ↓
MMLU Input-and-label contaminated .7093 ↑
Hellaswag Clean .7726
Hellaswag Input-only contaminated .8348 ↑
Hellaswag Input-and-label contaminated .8455 ↑
ARC Clean .4555
ARC Input-only contaminated .5632 ↑
ARC Input-and-label contaminated .5667 ↑
Average Clean .6348
Average Input-only contaminated .6882 ↑
Average Input-and-label contaminated .7072 ↑

We saw large model like Llama-2 70B is very good at utilise contaminated test samples to achieve higher metric.

For more info, check this repo and this pre-print.

Let me know if you have any plan about addressing the data contamination issue in the big-refactor branch.

liyucheng09 commented 11 months ago
dataset                                         version    mode    yi-6b-hf          -                              -                                        qwen-7b-hf        -                              -                                        llama-2-7b-hf     -                              -
----------------------------------------------  ---------  ------  ----------------  -----------------------------  ---------------------------------------  ----------------  -----------------------------  ---------------------------------------  ----------------  -----------------------------  ---------------------------------------
-                                               -          -       accuracy - clean  accuracy - input contaminated  accuracy - input-and-label contaminated  accuracy - clean  accuracy - input contaminated  accuracy - input-and-label contaminated  accuracy - clean  accuracy - input contaminated  accuracy - input-and-label contaminated
ceval-computer_network                          9b9417     ppl     54.55             50.00                          83.33                                    54.55             50.00                          50.00                                    18.18             0.00                           66.67
ceval-operating_system                          b2b8cf     ppl     57.14             nan                            80.00                                    42.86             nan                            60.00                                    21.43             nan                            40.00
ceval-computer_architecture                     1bd275     ppl     85.71             100.00                         100.00                                   85.71             100.00                         58.33                                    14.29             0.00                           41.67
ceval-college_programming                       2d0833     ppl     63.64             0.00                           85.71                                    45.45             0.00                           71.43                                    31.82             0.00                           14.29
ceval-college_physics                           fb7e04     ppl     50.00             25.00                          66.67                                    16.67             0.00                           11.11                                    33.33             75.00                          22.22
ceval-college_chemistry                         916b7d     ppl     47.62             100.00                         50.00                                    57.14             100.00                         50.00                                    19.05             0.00                           50.00
ceval-advanced_mathematics                      5cad2a     ppl     42.11             nan                            nan                                      21.05             nan                            nan                                      21.05             nan                            nan
ceval-probability_and_statistics                a6b30e     ppl     27.78             nan                            nan                                      33.33             nan                            nan                                      27.78             nan                            nan
ceval-discrete_mathematics                      68be68     ppl     42.86             0.00                           0.00                                     28.57             0.00                           0.00                                     21.43             100.00                         0.00
ceval-electrical_engineer                       056c2e     ppl     44.44             75.00                          80.00                                    22.22             50.00                          46.67                                    33.33             25.00                          26.67
ceval-metrology_engineer                        4a757a     ppl     75.00             50.00                          78.57                                    50.00             100.00                         78.57                                    25.00             0.00                           50.00
ceval-high_school_mathematics                   a8ed21     ppl     33.33             nan                            nan                                      16.67             nan                            nan                                      33.33             nan                            nan
ceval-high_school_physics                       e1fc86     ppl     66.67             50.00                          80.00                                    75.00             50.00                          60.00                                    8.33              50.00                          20.00
ceval-high_school_chemistry                     9021c6     ppl     56.25             nan                            100.00                                   75.00             nan                            33.33                                    12.50             nan                            66.67
ceval-high_school_biology                       c7f5a1     ppl     44.44             nan                            90.00                                    55.56             nan                            90.00                                    11.11             nan                            30.00
ceval-middle_school_mathematics                 213989     ppl     33.33             100.00                         100.00                                   26.67             100.00                         100.00                                   6.67              100.00                         33.33
ceval-middle_school_biology                     ce0420     ppl     100.00            nan                            90.91                                    100.00            nan                            90.91                                    40.00             nan                            54.55
ceval-middle_school_physics                     78f3af     ppl     57.14             100.00                         90.91                                    71.43             100.00                         90.91                                    42.86             0.00                           54.55
ceval-middle_school_chemistry                   d071d2     ppl     91.67             nan                            100.00                                   91.67             nan                            100.00                                   8.33              nan                            25.00
ceval-veterinary_medicine                       cd3a07     ppl     61.54             nan                            90.00                                    46.15             nan                            70.00                                    46.15             nan                            30.00
ceval-college_economics                         a35346     ppl     57.89             75.00                          62.50                                    52.63             50.00                          43.75                                    5.26              25.00                          40.62
ceval-business_administration                   69dd6a     ppl     61.54             100.00                         77.78                                    53.85             0.00                           66.67                                    30.77             50.00                          50.00
ceval-marxism                                   283ce0     ppl     100.00            100.00                         87.50                                    80.00             100.00                         75.00                                    40.00             0.00                           50.00
ceval-mao_zedong_thought                        f38cd1     ppl     100.00            nan                            94.44                                    66.67             nan                            72.22                                    33.33             nan                            55.56
ceval-education_science                         fbd65c     ppl     81.82             100.00                         82.35                                    72.73             100.00                         76.47                                    27.27             100.00                         52.94
ceval-teacher_qualification                     c77f1f     ppl     94.44             100.00                         95.65                                    83.33             50.00                          95.65                                    61.11             0.00                           21.74
ceval-high_school_politics                      bbac37     ppl     92.86             50.00                          100.00                                   100.00            100.00                         100.00                                   50.00             50.00                          33.33
ceval-high_school_geography                     730a30     ppl     63.64             nan                            100.00                                   81.82             nan                            87.50                                    36.36             nan                            37.50
ceval-middle_school_politics                    15b2d7     ppl     90.00             nan                            0.00                                     90.00             nan                            100.00                                   50.00             nan                            0.00
ceval-middle_school_geography                   b00167     ppl     100.00            100.00                         100.00                                   66.67             100.00                         100.00                                   0.00              0.00                           25.00
ceval-modern_chinese_history                    5a04cd     ppl     75.00             nan                            86.67                                    75.00             nan                            86.67                                    62.50             nan                            13.33
ceval-ideological_and_moral_cultivation         0829ff     ppl     80.00             nan                            100.00                                   100.00            nan                            78.57                                    80.00             nan                            42.86
ceval-logic                                     c9c394     ppl     86.67             nan                            57.14                                    60.00             nan                            57.14                                    33.33             nan                            14.29
ceval-law                                       cbd3c5     ppl     53.33             66.67                          83.33                                    26.67             0.00                           50.00                                    33.33             33.33                          50.00
ceval-chinese_language_and_literature           716ab3     ppl     46.15             100.00                         66.67                                    38.46             0.00                           55.56                                    30.77             0.00                           33.33
ceval-art_studies                               476114     ppl     71.43             nan                            68.42                                    64.29             nan                            63.16                                    28.57             nan                            52.63
ceval-professional_tour_guide                   70f30f     ppl     100.00            100.00                         88.24                                    90.00             50.00                          64.71                                    50.00             50.00                          29.41
ceval-legal_professional                        f19cf5     ppl     71.43             50.00                          71.43                                    57.14             50.00                          42.86                                    35.71             50.00                          28.57
ceval-high_school_chinese                       931614     ppl     83.33             nan                            75.00                                    91.67             nan                            75.00                                    33.33             nan                            25.00
ceval-high_school_history                       4d6364     ppl     75.00             66.67                          100.00                                   83.33             100.00                         100.00                                   50.00             66.67                          40.00
ceval-middle_school_history                     7f6356     ppl     90.91             100.00                         100.00                                   100.00            100.00                         100.00                                   18.18             0.00                           33.33
ceval-civil_servant                             a5dcb8     ppl     68.42             80.00                          58.82                                    73.68             80.00                          41.18                                    26.32             20.00                          17.65
ceval-sports_science                            192553     ppl     100.00            100.00                         55.56                                    62.50             50.00                          44.44                                    62.50             50.00                          22.22
ceval-plant_protection                          f7ff86     ppl     66.67             100.00                         88.89                                    66.67             0.00                           77.78                                    41.67             0.00                           33.33
ceval-basic_medicine                            a95a09     ppl     77.78             nan                            90.00                                    77.78             nan                            60.00                                    11.11             nan                            30.00
ceval-clinical_medicine                         664b54     ppl     78.57             0.00                           100.00                                   71.43             0.00                           42.86                                    14.29             0.00                           42.86
ceval-urban_and_rural_planner                   fdae6f     ppl     78.57             100.00                         53.33                                    64.29             0.00                           73.33                                    42.86             0.00                           20.00
ceval-accountant                                d810a1     ppl     70.59             71.43                          84.00                                    52.94             42.86                          68.00                                    35.29             28.57                          12.00
ceval-fire_engineer                             bb924d     ppl     41.67             0.00                           83.33                                    41.67             0.00                           72.22                                    25.00             0.00                           38.89
ceval-environmental_impact_assessment_engineer  d59200     ppl     76.19             50.00                          75.00                                    57.14             0.00                           75.00                                    33.33             0.00                           25.00
ceval-tax_accountant                            9e16f2     ppl     64.52             nan                            77.78                                    35.48             nan                            61.11                                    22.58             nan                            22.22
ceval-physician                                 0e90d5     ppl     79.17             100.00                         70.83                                    62.50             100.00                         58.33                                    29.17             100.00                         29.17
ceval-humanities                                -          ppl     74.42             75.00                          82.14                                    67.44             50.00                          70.54                                    37.98             41.67                          33.93
ceval-stem                                      -          ppl     53.70             57.14                          85.61                                    47.41             52.38                          67.63                                    23.70             33.33                          36.69
ceval-social-science                            -          ppl     81.60             84.62                          83.09                                    76.00             61.54                          72.79                                    36.80             30.77                          41.18
ceval-other                                     -          ppl     72.31             73.91                          75.00                                    58.46             39.13                          61.88                                    30.77             21.74                          25.00
ceval-hard                                      -          ppl     44.35             37.50                          70.00                                    41.13             25.00                          30.00                                    21.77             62.50                          30.00
ceval                                           -          ppl     67.32             71.01                          81.17                                    58.97             49.28                          67.82                                    30.46             30.43                          33.82

This is an example of how the results look like adding contamination analysis supported.

We could compare the performance on the clean and dirty sets and have a direct impression on to what extent models make use of memorisation instead of exhibiting true generalisation capability.

haileyschoelkopf commented 10 months ago

Closing because as discussed on Discord, the best way to integrate with this tool would be for it to upload a dataset cleaned/contaminated copy to HF hub and then one can run the task pointing to that dataset copy.

With our --use_cache feature this should not expand inference runtimes at all.