EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.27k stars 1.66k forks source link

Low results on TriviaQA #1292

Open yafuly opened 7 months ago

yafuly commented 7 months ago

Hi,

Thank you for your valuable contribution and impressive project. I evaluated the ``llama-2-chat'' model on the TriviaQA task and obtained very low performance:

Tasks Version Filter n-shot Metric Value Stderr
triviaqa Yaml remove_whitespace 0 exact_match 0.0229 ± 0.0011

The score from the original paper (Llama 2 report) is around 60+ in the zero-shot setting:

image

Is there something wrong with the evaluation script? Below is my task.yml file:

task: triviaqa
dataset_path: trivia_qa
dataset_name: rc.nocontext
output_type: generate_until
training_split: train
validation_split: validation
doc_to_text: "Question: {{question}}?\nAnswer:"
doc_to_target: "{{answer.aliases}}"
should_decontaminate: true
doc_to_decontamination_query: question
generation_kwargs:
  until:
    - "\n"
    - "."
    - ","
  do_sample: false
  temperature: 0.0
filter_list:
  - name: remove_whitespace
    filter:
      - function: remove_whitespace
      - function: take_first
target_delimiter: " "
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: true
metadata:
  version: 2.0
haileyschoelkopf commented 7 months ago

Hi! What version of the codebase are you using? We recently merged a fix in #1268 that affected all generation-based tasks with HF models that might be causing this.

I would not expect a score this low, but however--Llama 1 and 2 use prompts that are mostly undisclosed for their evaluations so it's tricky to reproduce them without knowing the setup.

For improving performance, you could try adding description: "Answer these questions:\n\n" and seeing what the performance is after that--I've observed adding a description especially helps zero-shot and helps but less so for the few-shot setting.

Hannibal046 commented 7 months ago
Hi, I tested it on my side. For llama2-7b-0-shot, this is the results I got: hf (pretrained=meta-llama/Llama-2-7b-hf), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto Tasks Version Filter n-shot Metric Value Stderr
triviaqa 3 remove_whitespace 0 exact_match 0.5247 ± 0.0037
After adding the description "Answer these questions:\n\n": hf (pretrained=meta-llama/Llama-2-7b-hf), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto Tasks Version Filter n-shot Metric Value Stderr
triviaqa 3 remove_whitespace 0 exact_match 0.584 ± 0.0037

Still a little bit lower that reported in the paper, which is 65.8.

BTW, I think if you want to evaluate chat model, you should provide the right template. For base model, you could just use the dafault.

Hannibal046 commented 7 months ago

This is the YAML file:

task: triviaqa
dataset_path: trivia_qa
dataset_name: rc.nocontext
output_type: generate_until
training_split: train
validation_split: validation
description: "Answer these questions:\n\n"
doc_to_text: "Question: {{question}}?\nAnswer:"
doc_to_target: "{{answer.aliases}}"
should_decontaminate: true
doc_to_decontamination_query: question
generation_kwargs:
  until:
    - "\n"
    - "."
    - ","
  do_sample: false
  temperature: 0.0
filter_list:
  - name: remove_whitespace
    filter:
      - function: remove_whitespace
      - function: take_first
target_delimiter: " "
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: true
metadata:
  version: 3.0
yafuly commented 7 months ago

Hi,

Thanks for your support @haileyschoelkopf @Hannibal046

Your empirical results are extremely helpful :) @Hannibal046

I believe the performance discrepancy may be attributed to two factors. Firstly, I mistakenly used an incorrect template that did not include system prompts for chat models. Secondly, it appears that the paper's results were obtained from a different subset of TrivaQA called TrivaQA wiki.

Apart from modifying the 'doc_to_text' section of the yaml file, are there any other methods to integrate system prompts or customized prompts to support chat LLMs?

haileyschoelkopf commented 7 months ago

@yafuly There is work-in-progress support in this PR: https://github.com/EleutherAI/lm-evaluation-harness/pull/1287 which if you'd like to take a look at feel free! (currently the system_prompt CLI arg does not hook up properly though...)

I hope to continue it very soon to get it added to the library, but am observing some decreased performance on the model I tested and so believe more handling will be needed on our end to determine the right way to format chat-templated + system-prompted evaluations. (with respect to whitespace, and to where each prompt component is placed in a chat).

pminervini commented 7 months ago

Can one map the "description:" tag to a function that calls the model's tokenizer by using e.g., "tokenizer.apply_chat_template"?

StellaAthena commented 6 months ago

Can one map the "description:" tag to a function that calls the model's tokenizer by using e.g., "tokenizer.apply_chat_template"?

This should be the behavior of a HuggingFace LM. If that's not how the draft PR functions, its either a bug or Hailey hasn't finished it yet.

jasonkrone commented 1 month ago

I'm similarly trying to replicate the llama2 7B results on trivia QA and hitting a gap between the scores from eval harness vs. the llama 2 paper in the 5-shot setting (eval harness: 64.08% vs. paper: 72.10%). Lmk if you have any ideas on where the gap might be coming from!

Update: my current best guess (credit to @yafuly for noting it) is that the gab is b/c llama2 reports results for the wiki eval set. I'll try evaluating with the rc.wikipedia.nocontext version tomorrow and see if that solves it.

jasonkrone commented 1 month ago

Tried rc.wikipedia.nocontext with n-shot = 5, and it gave a +2% boost but there's still a 6% gap. Will continue to update as I search for the gap.

Tasks Version Filter n-shot Metric Value Stderr
triviaqa 3 remove_whitespace 5 exact_match 0.6407 ± 0.0036
triviaqa_wiki 3 remove_whitespace 5 exact_match 0.6627 ± 0.0053
jasonkrone commented 1 month ago

Also tried modifying the filters to use the official repo's normalize answer function, which caused the scores to drop

Tasks Version Filter n-shot Metric Value Stderr
triviaqa_wiki 3 normalize_triviaqa_preds 5 exact_match 0.6234 ± 0.0054

I'm a bit stumped at the moment, so I opened an issue in the llama repo asking if they could share the prompt details for TriviaQA.

jasonkrone commented 3 weeks ago

We got a helpful response from the llama team! They pointed me to these eval details datasets they released on huggingface for llama 3.1 that include the model predictions, prompt format, and other eval config details (like max tokens to generate).

Using this info, I created a new trivia QA config (included below) to match the llama 3.1 setup as closely as possible. The changes I made were:

With this updated config, I got an exact match score of 67.75% for llama2 7b. That's still lower than the published score of 72.1% but it's closer.

Note the trivia QA config I made should only be used with num_fewshot <= 5 b/c of the way I setup the few-shot examples.

Config

task: triviaqa_wiki
dataset_path: trivia_qa
dataset_name: rc.wikipedia.nocontext
output_type: generate_until
training_split: train
validation_split: validation
description: "Answer these questions:\n\n"
doc_to_text: "Q: {{question}}?\nA:"
doc_to_target: "{{answer.aliases}}"
should_decontaminate: true
doc_to_decontamination_query: question
generation_kwargs:
  until:
    - "\n"
  do_sample: false
  temperature: 0.0
  # coppied from llama3
  max_gen_toks: 24
filter_list:
  - name: remove_whitespace
    filter:
      - function: remove_whitespace
      - function: take_first
target_delimiter: " "
fewshot_delimiter: "\n"
fewshot_config:
  sampler: first_n
  samples:
    - question: Who was President when the first Peanuts cartoon was published?
      answer:
        aliases:
          - Harry Truman
    - question: Which American-born Sinclair won the Nobel Prize for Literature in 1930?
      answer:
        aliases:
          - Sinclair Lewis
    - question: Where in England was Dame Judi Dench born?
      answer:
        aliases:
          - York
    - question: William Christensen of Madison, New Jersey, has claimed to have the world's biggest collection of what?
      answer:
        aliases:
          - Beer Cans
    - question: In which decade did Billboard magazine first publish and American hit chart?
      answer:
        aliases:
          - 30s
    - question: null
      answer:
        aliases:
          - null
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: true
metadata:
  version: 3.0
StellaAthena commented 3 weeks ago

@jasonkrone Have you tried comparing the results on a question-by-question basis? Especially if they share model predictions, that seems like a good way to debug further.