TIGER-AI-Lab / MMLU-Pro

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
Apache License 2.0
134 stars 23 forks source link

Chat template for instruct models for local eval #1

Closed gnalbandyan closed 5 months ago

gnalbandyan commented 6 months ago

Hi, thanks for open sourcing the dataset. Here in evaluate_from_local.py few shot prompt is created as a single string and is directly tokenized. But HF tokenizer has a chat_template as demonstrated in LLama3-70B HF Readme , where we can use system->user->system type chat to create the few shot prompt. Is there any reason why this is not used? Do you know the difference in final metric with and without the chap template? Thanks.

Wyyyb commented 5 months ago

This is a great question. Some chat models are fine-tuned with a chat template during alignment, so the model sees interactions like:

User: I have an instruction now, here's an example: input A, output A Assistant: OK User: Here's another example, input B, output B Assistant: OK User: input C, what's the output of input C?

In our few-shot setup, we include the examples directly in the input text. The reason we do this is to keep consistency between chat models, base models, and instruct models that do not support multi-turn dialogue, making it easier to compare them. However, we have not conducted experiments to explore the differences in final metrics with and without the chat template. We leave this to future work.