EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.73k stars 1.79k forks source link

Support wrapping prompts with a given Chat Template #1098

Closed haileyschoelkopf closed 4 months ago

haileyschoelkopf commented 10 months ago

LM Evaluation Harness has historically been used and designed around evaluation of base language models in the few-shot and zero-shot settings.

Given the substantial interest in evaluating chat models using lm-eval, an important feature would be allowing one to take the prompt in a benchmark, and evaluate

There are a few considerations that will come up in how we implement this:

An adjacent feature to this would be considering the addition of more flexible / permissive answer extraction.

Chat templating is a fairly high priority addition to the library based on user requests and feedback, but we want to make sure we add it in a way that doesn't hurt reproducibility and that is easy to maintain, utilize as a non-power-user, and work with in general.

baberabb commented 10 months ago

I thought HF did a great job in explaining how their chat templating works.

haileyschoelkopf commented 10 months ago

My previous hangup with HF's chat templating support was that it was new and wasn't clear if people were going to use it. But I looked and quite a few notable finetunes seem to use the feature, so I think using HF's setup may be the way to go.

StellaAthena commented 10 months ago

My previous hangup with HF's chat templating support was that it was new and wasn't clear if people were going to use it. But I looked and quite a few notable finetunes seem to use the feature, so I think using HF's setup may be the way to go.

I concur

baberabb commented 10 months ago
  • How should we allow for users to specify a system prompt, if at all?

IMO we probably should allow users to pass system prompts. The LLama-2 paper provides a lot of detail about the different prompts they used (for human evaluation though).

Screenshot 2023-12-14 at 14 18 19

The default behaviour of the HF tokenizer seems to be to apply the chat template to all segments and tokenize in one go. This question might be more relevant for multi-turn evaluations.

anjor commented 9 months ago

There is this effort to handle the different chat formats and conversions in between -- https://github.com/deployradiant/pychatml

lewtun commented 8 months ago

This feature would be great for benchmarks like IFEval which expect a chat formatted input!

Regarding chat templates in transformers models, these can be inferred from the chat_template attribute of the tokenizer config, so one possible way to do this would be to apply if directly if it exists or allow the user to override this by passing the Jinja string in a --chat_template arg

For the system prompt, I suppose a --system_prompt arg could be added which then inserts a system role/content into the messages.

Finally, for few-shot I believe the recommended formatting is as to provide examples as alternating user/assistant pairs like this:

[
    {"role": "user", "content": "fewshot_prompt_1"},
    {"role": "assistant", "content": "fewshot_answer_1"},
    {"role": "user", "content": "fewshot_prompt_2"},
    {"role": "assistant", "content": "fewshot_answer_2"},
    {"role": "user", "content": "final_prompt"}
]

Which in ChatML would produce something like

<|im_start|>user
fewshot_prompt_1<|im_end|>
<|im_start|>assistant
fewshot_answer_1<|im_end|>
<|im_start|>user
fewshot_prompt_2<|im_end|>
<|im_start|>assistant
fewshot_answer_2<|im_end|>
<|im_start|>user
final_prompt<|im_end|>
<|im_start|>assistant
daniel-furman commented 8 months ago

@lewtun see branch “add_chat_templating”. We are implementing with the HF tokenizer. One problem is figuring out how to pick out the few shot examples for different kinds of tests - at the moment the code just wraps the whole context as one user message, but I agree it would be ideal to have each few shot example as a new user/assistant entry in the list of messages. I’m trying this on my own fork in a hacky way just to see what the results are for specific tests. Good tip on the IFEval I will check that out! CC @haileyschoelkopf

daniel-furman commented 8 months ago

@haileyschoelkopf is IFEval in the harness yet (https://arxiv.org/pdf/2311.07911.pdf)? Not seeing it after running:

!lm-eval --tasks list
haileyschoelkopf commented 8 months ago

It is but requires installation of the ifeval extra first!

(We should see about making a more clear listing for this case.)

lewtun commented 8 months ago

@lewtun see branch “add_chat_templating”. We are implementing with the HF tokenizer. One problem is figuring out how to pick out the few shot examples for different kinds of tests - at the moment the code just wraps the whole context as one user message, but I agree it would be ideal to have each few shot example as a new user/assistant entry in the list of messages. I’m trying this on my own fork in a hacky way just to see what the results are for specific tests. Good tip on the IFEval I will check that out! CC @haileyschoelkopf

Very nice! Feel free to ping me or @Rocketknight1 (the creator of the templating system in transformers) if you need any help 🤗

daniel-furman commented 8 months ago

@lewtun @haileyschoelkopf I'm currently running a test with Mixtral-8x7B-Instruct on IFEval with and without chat templating, will report back my results tomorrow morning.

First element before prompt formatting...
('Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.', {'until': [], 'do_sample': False, 'temperature': 0.0, 'max_gen_toks': 1280})
First element after prompt formatting...
('<s>[INST] Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*. [/INST]', {'until': [], 'do_sample': False, 'temperature': 0.0, 'max_gen_toks': 1280})
daniel-furman commented 8 months ago

^ results for above are super promising!

W/o prompt formatting:

Tasks VersionFiltern-shot

W/ prompt formatting:

Tasks VersionFiltern-shot
!lm_eval --model hf \
    --model_args=pretrained=mistralai/Mixtral-8x7B-Instruct-v0.1,dtype="bfloat16",load_in_4bit=True,use_chat_template=True \
    --tasks ifeval \
    --batch_size 2 \
    --output_path /content/output/Mixtral-8x7B-Instruct-v0.1 \
    --log_samples \
    --device cuda:0 \
    --num_fewshot 0

Changing use_chat_template to True/False is only change between runs!

haileyschoelkopf commented 8 months ago

That's fantastic to hear @daniel-furman !

I'm sorry for the delay in pushing this feature forward (and the fact that the system prompt in this branch is not currently yet configurable).

The major blocker to getting this merged is the fact that, in my testing, scores were worse for models evaluated on tasks like arc_easy because of whitespacing needing to change when wrapping the input in a chat template--we need to handle this in some way intuitive for a user, without overly complicating the task config files. Additionally, we still need to decide how to configure tasks' prompts such that one can have a "trigger" like Let's think step by step. attached to the beginning of the final model chat turn.

daniel-furman commented 8 months ago

@haileyschoelkopf No worries! This IFEval result is the first positive swing I have seen in my testing.

Couple thoughts... a) wouldn't we want that trigger at the end of each model chat turn b) think biggest blocker is getting the few-shots to act as new user/assistant messages as per @lewtun's suggestion above (perhaps this is captured in your update above but just confirming / putting into new language). c) @lewtun's change in code review will fix the system prompt configurability d) I found one other small change necessary for generate_until helper

daniel-furman commented 8 months ago

@haileyschoelkopf @lewtun I went a little overboard testing IFEval this weekend...

image

Experiments were executed with models in half precision bfloat16, a workstation equipped with 2x H100 80 GB SXM5 chips, and my fork of the lm-eval package at hash 0c0c314c0df4c10f35bf7c17dc80f745f8027e9b.

More details can be found here @ towardsdatascience.com/evaluations-with-chat-formats-7604067023c9

CC @Rocketknight1 @clefourrier

clefourrier commented 8 months ago

Btw, the user/assistant turns is something we added in lighteval here so if you want to reuse this snippet in the harness feel free to do so :) (Finished logic is in this PR)

clefourrier commented 8 months ago

Super happy to see the feature coming to the harness! It's been highly requested and it will really be interesting to see how it changes rankings on the leaderboard :fire: (@daniel-furman 's analysis already seems to show it would shuffle things quite a bit :smiling_imp: )

baberabb commented 8 months ago

Btw, the user/assistant turns is something we added in lighteval here so if you want to reuse this snippet in the harness feel free to do so :) (Finished logic is in this PR)

The alternation between "user" and "assistant" for the few-shot text and targets does seem more intuitive! Would require a clearer delineation between each example though, which isn't true for many tasks right now. Also would probably need to down-stream the few-shot construction to the model level rather than sending it through as one big blob.

Rocketknight1 commented 8 months ago

Hey @daniel-furman that table is great! Can I use it in a tweet about chat templates?

daniel-furman commented 8 months ago

@Rocketknight1 you betcha, go for it! Just to note - this is the first eval I am seeing a positive swing via chat templating. The open llm leaderboard evals are showing a regression after applying templates - pending further development though to pinpoint the cause of that result.

Rocketknight1 commented 8 months ago

That's surprising! I would expect that chat models should work much better in basically all cases when their correct template is used

daniel-furman commented 8 months ago

Those results are pending spacing issues on continuation / few shot setup / testing more models. @haileyschoelkopf was seeing the same dips on her initial tests. There’s a reason I picked IFEval :).

lewtun commented 8 months ago

Those results are pending spacing issues on continuation / few shot setup / testing more models. @haileyschoelkopf was seeing the same dips on her initial tests. There’s a reason I picked IFEval :).

@clefourrier do you also see performance drops when evaluating chat models with templates in lighteval?

clefourrier commented 7 months ago

@lewtun Yep! But I'll be able to do a more in depth and comparable analysis once we integrate ifeval, to see if we observe the same

monk1337 commented 6 months ago

@daniel-furman Awesome, is there a way I can define a custom template? I am using your forked lm_eval repo.

clefourrier commented 4 months ago

@haileyschoelkopf maybe we can close this discussion now that the feature has been added? :)

djstrong commented 3 months ago

@daniel-furman @haileyschoelkopf @KonradSzafer I have tested some models without template, with template and with template as multiturn. Results are published in LB: https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard

Comparison sheet: https://docs.google.com/spreadsheets/d/1b5s0LuhAQLbtzexxasLBA47gQ-R3HOuIKFRfIgJflf0/edit?usp=sharing

There are average scores for multiple_choice and generate_until tasks for 0-shot and 5-shot.

For 5-shot the scores mostly drop using chat templates. For 0-shot it is better, especially for multiple_choice tasks.