Different Setup for Different Models?

chigkim commented 4 months ago

Hi,

I realized that different scripts have different setup for different models. Wouldn't this lead to inconsistent test results?

Sampling parameters:

GPT-4o: temperature=0.1 and top_p=1.0
Gemini: temperature=0.0 and top_p=0.95
Claude-3: temperature=0.0 and top_p=1.0

System prompt:

GPT-4o with OpenAI: You are an knowledge expert, you are supposed to answer the multi-choice question to derive your final answer as The answer is ....
GPT-4 with AzureOpenAI: The following are multiple choice questions (with answers) about {subject}. Think step by step and then output the answer in the format of "The answer is (X)" at the end.
Gemini: Finish your answer with Final Answer: (X) where X is the correct letter choice. If none or more than one of the options match, choose the one that is the closest.
vllm: The following are multiple choice questions (with answers) about {subject}. Think step by step and then finish your answer with "the answer is (X)" where X is the correct letter choice.
Claude-3: No system prompt

Regex to extract answers:

GPT-4o: single extraction, r"answer is $?([ABCDEFGHIJ])$?"
gemini: double extractions, r"(Answer:|answer is)\s*$?([ABCDEFGHIJ])$?", r' (:|is)\s*$?([ABCDEFGHIJ])$?\b'
vllm: double extractions, r"answer is $?([ABCDEFGHIJ])$?", r'.*[aA]nswer:\s*([A-J])'
GPT-4 with AzureOpenAI: triple extractions, r"answer is $?([A-J])$?", r'.*[aA]nswer:\s*([A-J])', r"[A-J](?=[^A-J]*$)"

Are these scripts on this repo the same scripts to produce the result on the MMLU Pro paper?

Wyyyb commented 4 months ago

Thank you for your question. First, regarding sampling parameters and system prompts, we recommend using the settings in evaluate_from_api.py and evaluate_from_local.py found in our git repository, as these are consistent with the results reported in our paper. For results related to closed-source models like GPT-4o, Claude-3, and Gemini, there are slight variations since they were not run concurrently by the same collaborator. However, we have conducted sampling tests and found that the impact on the results is minimal, not exceeding 1%. Our paper also highlights the robustness of MMLU-Pro, so we opted not to rerun everything from a cost-saving perspective. If anyone has completely rerun the tests using the evaluate_from_api.py settings, we welcome you to share your results with us.

Regarding your question about the answer extraction regex, it is indeed an important issue. For high-performing models like GPT-4o and Gemini, the impact is minimal, but for smaller-scale models, it can be more significant. We are planning to introduce answer extraction regexes with higher recall and will standardize and re-extract answers accordingly.

chigkim commented 4 months ago

Thanks for the response. Even evaluate_from_api.py and evaluate_from_local.py have different prompts and sampling parameters. Which one should we use to match the paper or MMLU Pro HF space?

evaluate_from_api.py: Prompt: The following are multiple choice questions (with answers) about {subject}. Think step by step and then output the answer in the format of \"The answer is (X)\" at the end.\n\n Sampling: temperature 0, top_p = 1

evaluate_from_local.py: Prompt: The following are multiple choice questions (with answers) about {subject}. Think step by step and then finish your answer with "the answer is (X)" where X is the correct letter choice. Sampling: temperature=0, no top_p is specified.

Wyyyb commented 4 months ago

To align with the paper and leaderboard, use evaluate_from_local.py for open-source models and evaluate_from_api.py for proprietary models.

chigkim commented 4 months ago

Thank you for the clarification! I'm hoping you could help me with one more question. The script evaluate_from_local.py only specifies temperature=0.0 without top_p whereas evaluate_from_api.py specifies both temperature=0.0 and top_p 1.0. It looks like evaluate_from_local.py uses VLLM, and VLLM sampling_params.py seems to use top_p=1.0 as default. Thus, if not specified, I assume top_p=1.0 will be used? Then, here is my question. Could you help me to understand why setting top_p to 1.0 is better in the context of MMLU Pro benchmark than using smaller value like 0.01 (or even 0.0) for more deterministic response? Again, thanks so much for your help!

Wyyyb commented 4 months ago

Smaller top_p values may restrict the model's response options primarily to the most probable choices, potentially excluding correct yet less obvious answers. Setting top_p=1.0 enables the model to explore a broader spectrum of potential responses, thereby reducing the chances of overlooking accurate but less likely outputs. Additionally, we use a small temperature value to help ensure the consistency and reproducibility of the results.

chigkim commented 4 months ago

Thanks for the explanation! :)

TIGER-AI-Lab / MMLU-Pro

Different Setup for Different Models? #5