TIGER-AI-Lab / MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
Apache License 2.0
133 stars 22 forks source link

Regex pattern in extract_final function. #7

Closed chigkim closed 4 months ago

chigkim commented 4 months ago

I noticed that evaluate_from_local.py is updated with extract_final function.

    pattern = r"[A-J](?=[^A-J]*$)"
    match = re.search(pattern, text)

Wouldn't the regex pattern take any last capital letter A-J in a response as an answer? For example, if a response says "..... A perfect answer cannot be found." Then it'd extract A as an answer because the that's the last a capital letter between A-J in the response. Isn't it highly likely that every response has at least one capital letter between A-J somewhere? When I tested a model with this regex in the last extraction chain, it never triggers x = random.randint(0, len(each["options"]) - 1).

chigkim commented 4 months ago

Just throwing out an idea for a regex for the final extraction... r'\b[A-J]\b(?!.*\b[A-J]\b)' This will similarly match only last [A-J] (with negative lookahead) but only when it's by itself between word boundaries.

text = "The answer might be C. Alszo Answer could be (F) It might be option  J. Dog is not an answer for sure."
# The current final regex matches D from "Dog."
pattern = r"[A-J](?=[^A-J]*$)"
# The proposed regex matches J from "option  J"
pattern = r'\b[A-J]\b(?!.*\b[A-J]\b)'
m = re.search(pattern,text)
print(m[0])

The proposed regex will match "A" in a sentence that starts with "A". I.E. "A solution is...". It's not ideal, but I think it would be better than matching Any last [A-J] anywhere regardless whether it's part of a word like Dog. Any thought?

chigkim commented 4 months ago

Sorry for my comment spam... :) I noticed that ".*" in the lookahead doesn't match newline, so using re.DOTALL seems to work nicely!

pattern = r"\b[A-J]\b(?!.*\b[A-J]\b)"
match = re.search(pattern, text, re.DOTALL)
if match:
    return match[0]
wenhuchen commented 4 months ago

These are great suggestions. Have you tried how much the results will differ? We tried some combination and the final performance doesn't change at all. It would be great to see more broadly how much regex could impact the final performance.

chigkim commented 4 months ago

I'm running some benchmark right now, but I'm only working with M3 Max 64GB. My compute power is pretty limited, so I'm only testing quants. Also most people on the r/LocalLLaMA would be interested in quants instead of full precision. I also wonder maybe that's why you don't see much difference if you benchmark FP instead of like q8? Anyhow, I'll report back in a couple of days. :)

wenhuchen commented 4 months ago

I created a separate file to benchmark different regex outcome.

Please refer to https://github.com/TIGER-AI-Lab/MMLU-Pro?tab=readme-ov-file#benchmarking-answer-extraction. People can fiddle with the compute_accuracy.py to see its impact.

chigkim commented 4 months ago

Thanks for the script! That's smart! I totally ignore the fact that it saves the model's responses! I just reran the whole thing three times. lol I modified your compute_accuracy.py to calculate the total, and the difference between single vs triple extractions looks about 2.03% which is not huge, but it could be relevant when comparing similar models. If I run the final extraction with the word the boundary regex I suggested, the difference is about 2.49%.

D:\code\python\MMLU-Pro>python compute_accuracy.py llama3-8b-instruct-q8_0
Level 1 regex========================================
llama3-8b-instruct-q8_0\biology_result.json 0.6178521617852162
Level 2 regex========================================
llama3-8b-instruct-q8_0\biology_result.json 0.6596931659693166
Difference: 0.04184100418410042
Level 1 regex========================================
llama3-8b-instruct-q8_0\business_result.json 0.4296577946768061
Level 2 regex========================================
llama3-8b-instruct-q8_0\business_result.json 0.4309252217997465
Difference: 0.0012674271229404233
Level 1 regex========================================
llama3-8b-instruct-q8_0\chemistry_result.json 0.3129973474801061
Level 2 regex========================================
llama3-8b-instruct-q8_0\chemistry_result.json 0.33244916003536695
Difference: 0.01945181255526085
Level 1 regex========================================
llama3-8b-instruct-q8_0\computer science_result.json 0.3804878048780488
Level 2 regex========================================
llama3-8b-instruct-q8_0\computer science_result.json 0.4
Difference: 0.019512195121951237
Level 1 regex========================================
llama3-8b-instruct-q8_0\economics_result.json 0.5071090047393365
Level 2 regex========================================
llama3-8b-instruct-q8_0\economics_result.json 0.5225118483412322
Difference: 0.0154028436018957
Level 1 regex========================================
llama3-8b-instruct-q8_0\engineering_result.json 0.32094943240454077
Level 2 regex========================================
llama3-8b-instruct-q8_0\engineering_result.json 0.3364293085655315
Difference: 0.015479876160990724
Level 1 regex========================================
llama3-8b-instruct-q8_0\health_result.json 0.4792176039119804
Level 2 regex========================================
llama3-8b-instruct-q8_0\health_result.json 0.5354523227383863
Difference: 0.05623471882640585
Level 1 regex========================================
llama3-8b-instruct-q8_0\history_result.json 0.3910761154855643
Level 2 regex========================================
llama3-8b-instruct-q8_0\history_result.json 0.4068241469816273
Difference: 0.015748031496062964
Level 1 regex========================================
llama3-8b-instruct-q8_0\law_result.json 0.26430517711171664
Level 2 regex========================================
llama3-8b-instruct-q8_0\law_result.json 0.2742960944595822
Difference: 0.009990917347865558
Level 1 regex========================================
llama3-8b-instruct-q8_0\math_result.json 0.3545521835677276
Level 2 regex========================================
llama3-8b-instruct-q8_0\math_result.json 0.37083641746854185
Difference: 0.016284233900814238
Level 1 regex========================================
llama3-8b-instruct-q8_0\other_result.json 0.44696969696969696
Level 2 regex========================================
llama3-8b-instruct-q8_0\other_result.json 0.46320346320346323
Difference: 0.016233766233766267
Level 1 regex========================================
llama3-8b-instruct-q8_0\philosophy_result.json 0.3867735470941884
Level 2 regex========================================
llama3-8b-instruct-q8_0\philosophy_result.json 0.4288577154308617
Difference: 0.04208416833667333
Level 1 regex========================================
llama3-8b-instruct-q8_0\physics_result.json 0.3371824480369515
Level 2 regex========================================
llama3-8b-instruct-q8_0\physics_result.json 0.3579676674364896
Difference: 0.02078521939953809
Level 1 regex========================================
llama3-8b-instruct-q8_0\psychology_result.json 0.6090225563909775
Level 2 regex========================================
llama3-8b-instruct-q8_0\psychology_result.json 0.6190476190476191
Difference: 0.010025062656641603
Total level 1 0.40495386917130743
Total level 2 0.4253179286842324
Total Difference: 0.02036405951292497
wenhuchen commented 4 months ago

Would you mind making a pull request to merge it to our scripts?

chigkim commented 4 months ago

Sure, done: https://github.com/TIGER-AI-Lab/MMLU-Pro/pull/8

chigkim commented 4 months ago

For my curiosity, I've ran some comparison tests on llama3-8b-instruct-q8_0. In order to isolate each parameter, I ran 4 different tests and then extracted the scores for different regex patterns using compute_accuracy.py.

It looks like single_chat format has a dramatic impact on the score. Answer key extraction method with different regex patterns seem to have some minor impact.

Single_chat includes everything including ICL examples and the question into one user message like how old run_gpt4o.py does. Multi_chat splits ICL examples into multi-turn style.

System Prompts:

Settings: Rows

  1. Systemp prompt 1, Temperature 0.0, Multi_chat
  2. Systemp prompt 1, Temperature 0.1, Multi_chat
  3. Systemp prompt 2, Temperature 0.0, Multi_chat
  4. Systemp prompt 1, Temperature 0.0, Single_chat

Regex patterns: columns:

  1. Single layer: r"answer is (?([ABCDEFGHIJ]))?"
  2. Double layers including the regex 1 and: r".[aA]nswer:\s(?([A-J]))?"
  3. Triple layers including regex 1+2 and: r"A-J"
  4. Triple layers including regex 1+2 and: r"\b[A-J]\b(?!.*\b[A-J]\b)"
Settings 1 2 3 4
Latest 40.60 40.62 42.65 43.07
Temp0.1 40.46 40.46 42.16 42.82
Prompt 2 40.62 40.61 42.75 43.06
Single_chat 21.01 21.02 21.49 21.89
wenhuchen commented 4 months ago

This is awesome! Thanks for benchmarking these! Would it be possible to have a separate a reddit post to summarize all of your findings? I can add a link on my github repo to help people understand the difference.

chigkim commented 4 months ago

Done! https://www.reddit.com/r/LocalLLaMA/comments/1e4eyoi/mmlu_pro_how_different_parameters_and_regex/

Thank you @Wyyyb and @wenhuchen for helping me understand MMLU Pro better!

The main goal of my personal fork was to benchmark the differences between various quantizations easily. I posted it out of excitement without too much thought, but it appears to be the first benchmark tool other than perplexity test that people on the sub can utilize whatever inferencing tools they prefer and conduct the benchmark with open-source models without much ML and coding background. Although I initially chose the wrong script to modify and faced some challenges, it turned out to be a valuable learning experience at the end!