EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.41k stars 1.69k forks source link

always get acc,acc_norm, perplexity =1 on triviaqa task based on llama2 model #1239

Open learner-crapy opened 8 months ago

learner-crapy commented 8 months ago

I use the following command to run triviaqa task lm_eval --model hf \ --model_args pretrained=../llama/models_hf/7B \ --tasks triviaqa \ --num_fewshot 1 \ --device cuda:2 \ --batch_size 8 I just get acc_norm=1, it's same when I use acc or perplexity indicator. image

image

image

haileyschoelkopf commented 8 months ago

Hi! Could you provide the YAML file and codebase commit you are using to evaluate triviaqa?

This output seems quite strange given that triviaqa uses none of these metrics in its config.

I can't seem to replicate your result when I try locally to run triviaqa on gpt2.

learner-crapy commented 8 months ago

Hi, Thank you for your response. with the exact_match indicator, I get value=0.07. image

here is the YAML I used, I only changed the metric.

task: triviaqa
dataset_path: trivia_qa
dataset_name: rc.nocontext
output_type: generate_until
training_split: train
validation_split: validation
doc_to_text: "Question: {{question}}?\nAnswer:"
doc_to_target: "{{answer.aliases}}"
should_decontaminate: true
doc_to_decontamination_query: question
generation_kwargs:
  until:
    - "\n"
    - "."
    - ","
  do_sample: false
  temperature: 0.0
filter_list:
  - name: remove_whitespace
    filter:
      - function: remove_whitespace
      - function: take_first
target_delimiter: " "
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
    ignore_case: true
    ignore_punctuation: true
metadata:
  version: 2.0

For the code, I used the following command and changed nothing except the above YAML file.

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
haileyschoelkopf commented 8 months ago

those metrics you used are currently only supported for loglikelihood or multiple_choice output_type tasks. In your case, this could be typically achieved by setting output_type: loglikelihood, but is complicated by triviaqa’s use of multiple gold standard answers.

We’ll make sure that running this will error out explicitly in future to avoid confusion.

We are also working on making metrics more easy to understand and add in the library in general!

learner-crapy commented 8 months ago

Thanks a lot.