Open yafuly opened 10 months ago
Hi! What version of the codebase are you using? We recently merged a fix in #1268 that affected all generation-based tasks with HF models that might be causing this.
I would not expect a score this low, but however--Llama 1 and 2 use prompts that are mostly undisclosed for their evaluations so it's tricky to reproduce them without knowing the setup.
For improving performance, you could try adding description: "Answer these questions:\n\n"
and seeing what the performance is after that--I've observed adding a description especially helps zero-shot and helps but less so for the few-shot setting.
Hi, I tested it on my side. For llama2-7b-0-shot , this is the results I got:
hf (pretrained=meta-llama/Llama-2-7b-hf), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
triviaqa | 3 | remove_whitespace | 0 | exact_match | 0.5247 | ± | 0.0037 |
After adding the description "Answer these questions:\n\n" :
hf (pretrained=meta-llama/Llama-2-7b-hf), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
triviaqa | 3 | remove_whitespace | 0 | exact_match | 0.584 | ± | 0.0037 |
Still a little bit lower that reported in the paper, which is 65.8.
BTW, I think if you want to evaluate chat model, you should provide the right template. For base model, you could just use the dafault.
This is the YAML file:
task: triviaqa
dataset_path: trivia_qa
dataset_name: rc.nocontext
output_type: generate_until
training_split: train
validation_split: validation
description: "Answer these questions:\n\n"
doc_to_text: "Question: {{question}}?\nAnswer:"
doc_to_target: "{{answer.aliases}}"
should_decontaminate: true
doc_to_decontamination_query: question
generation_kwargs:
until:
- "\n"
- "."
- ","
do_sample: false
temperature: 0.0
filter_list:
- name: remove_whitespace
filter:
- function: remove_whitespace
- function: take_first
target_delimiter: " "
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 3.0
Hi,
Thanks for your support @haileyschoelkopf @Hannibal046
Your empirical results are extremely helpful :) @Hannibal046
I believe the performance discrepancy may be attributed to two factors. Firstly, I mistakenly used an incorrect template that did not include system prompts for chat models. Secondly, it appears that the paper's results were obtained from a different subset of TrivaQA called TrivaQA wiki.
Apart from modifying the 'doc_to_text' section of the yaml file, are there any other methods to integrate system prompts or customized prompts to support chat LLMs?
@yafuly There is work-in-progress support in this PR: https://github.com/EleutherAI/lm-evaluation-harness/pull/1287 which if you'd like to take a look at feel free! (currently the system_prompt
CLI arg does not hook up properly though...)
I hope to continue it very soon to get it added to the library, but am observing some decreased performance on the model I tested and so believe more handling will be needed on our end to determine the right way to format chat-templated + system-prompted evaluations. (with respect to whitespace, and to where each prompt component is placed in a chat).
Can one map the "description:" tag to a function that calls the model's tokenizer by using e.g., "tokenizer.apply_chat_template"?
Can one map the "description:" tag to a function that calls the model's tokenizer by using e.g., "tokenizer.apply_chat_template"?
This should be the behavior of a HuggingFace LM. If that's not how the draft PR functions, its either a bug or Hailey hasn't finished it yet.
I'm similarly trying to replicate the llama2 7B results on trivia QA and hitting a gap between the scores from eval harness vs. the llama 2 paper in the 5-shot setting (eval harness: 64.08% vs. paper: 72.10%). Lmk if you have any ideas on where the gap might be coming from!
Update: my current best guess (credit to @yafuly for noting it) is that the gab is b/c llama2 reports results for the wiki eval set. I'll try evaluating with the rc.wikipedia.nocontext version tomorrow and see if that solves it.
Tried rc.wikipedia.nocontext with n-shot = 5, and it gave a +2% boost but there's still a 6% gap. Will continue to update as I search for the gap.
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
triviaqa | 3 | remove_whitespace | 5 | exact_match | ↑ | 0.6407 | ± | 0.0036 |
triviaqa_wiki | 3 | remove_whitespace | 5 | exact_match | ↑ | 0.6627 | ± | 0.0053 |
Also tried modifying the filters to use the official repo's normalize answer function, which caused the scores to drop
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
triviaqa_wiki | 3 | normalize_triviaqa_preds | 5 | exact_match | ↑ | 0.6234 | ± | 0.0054 |
I'm a bit stumped at the moment, so I opened an issue in the llama repo asking if they could share the prompt details for TriviaQA.
We got a helpful response from the llama team! They pointed me to these eval details datasets they released on huggingface for llama 3.1 that include the model predictions, prompt format, and other eval config details (like max tokens to generate).
Using this info, I created a new trivia QA config (included below) to match the llama 3.1 setup as closely as possible. The changes I made were:
With this updated config, I got an exact match score of 67.75% for llama2 7b. That's still lower than the published score of 72.1% but it's closer.
Note the trivia QA config I made should only be used with num_fewshot <= 5 b/c of the way I setup the few-shot examples.
task: triviaqa_wiki
dataset_path: trivia_qa
dataset_name: rc.wikipedia.nocontext
output_type: generate_until
training_split: train
validation_split: validation
description: "Answer these questions:\n\n"
doc_to_text: "Q: {{question}}?\nA:"
doc_to_target: "{{answer.aliases}}"
should_decontaminate: true
doc_to_decontamination_query: question
generation_kwargs:
until:
- "\n"
do_sample: false
temperature: 0.0
# coppied from llama3
max_gen_toks: 24
filter_list:
- name: remove_whitespace
filter:
- function: remove_whitespace
- function: take_first
target_delimiter: " "
fewshot_delimiter: "\n"
fewshot_config:
sampler: first_n
samples:
- question: Who was President when the first Peanuts cartoon was published?
answer:
aliases:
- Harry Truman
- question: Which American-born Sinclair won the Nobel Prize for Literature in 1930?
answer:
aliases:
- Sinclair Lewis
- question: Where in England was Dame Judi Dench born?
answer:
aliases:
- York
- question: William Christensen of Madison, New Jersey, has claimed to have the world's biggest collection of what?
answer:
aliases:
- Beer Cans
- question: In which decade did Billboard magazine first publish and American hit chart?
answer:
aliases:
- 30s
- question: null
answer:
aliases:
- null
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: true
metadata:
version: 3.0
@jasonkrone Have you tried comparing the results on a question-by-question basis? Especially if they share model predictions, that seems like a good way to debug further.
Writing out some examples, this is what I see:
Question: At what weight did Alan Minter win his boxing world title?? Answer: Middle-weight
Question: """Superbad. Superdad."" was the tagline for which 2010 film?"? Answer: Despicable Me (single)
Question: Which of the Earth's atmospheric layers reflects radio waves?? Answer: Ionospheric model
Question: What shape is farfalle pasta?? Answer: Bow-ties
Question: Who was the lead singer in the US rock and roll group The Teenagers, who died in February 1968, aged 25?? Answer: Frankie Lyman ...
The two question marks seems like a problem?
I believe the YAML the question mark should be removed:
doc_to_text: "Q: {{question}}\nA:"
Hi,
Thank you for your valuable contribution and impressive project. I evaluated the ``llama-2-chat'' model on the TriviaQA task and obtained very low performance:
The score from the original paper (Llama 2 report) is around 60+ in the zero-shot setting:
Is there something wrong with the evaluation script? Below is my task.yml file: