EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.62k stars 1.75k forks source link

Invalid modification to task YAML file #2076

Open duanhx1037 opened 3 months ago

duanhx1037 commented 3 months ago

I am running llama2 model in wikitext dataset. I just want try some other metrics so I modify the default YAML file(lm-evaluation-harness/lm_eval/tasks/wikitext/wikitext.yaml) to the following, just deleting perplexity and adding acc.

task: wikitext
dataset_path: EleutherAI/wikitext_document_level
dataset_name: wikitext-2-raw-v1
output_type: loglikelihood_rolling
training_split: train
validation_split: validation
test_split: test
doc_to_text: ""
doc_to_target: !function preprocess_wikitext.wikitext_detokenizer
process_results: !function preprocess_wikitext.process_results
should_decontaminate: true
doc_to_decontamination_query: "{{page}}"
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
metadata:
  version: 2.0
dataset_kwargs:
  trust_remote_code: true

But when I run the evaluation again, I get an error like this:

Traceback (most recent call last):
  File "/home/xxx/environment/miniconda/miniconda3/envs/hf-transformers/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/data/xxx/lm-evaluation-harness/lm_eval/__main__.py", line 385, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/data/xxx/lm-evaluation-harness/lm_eval/utils.py", line 395, in _wrapper
    return fn(*args, **kwargs)
  File "/data/xxx/lm-evaluation-harness/lm_eval/evaluator.py", line 308, in simple_evaluate
    results = evaluate(
  File "/data/xxx/lm-evaluation-harness/lm_eval/utils.py", line 395, in _wrapper
    return fn(*args, **kwargs)
  File "/data/xxx/lm-evaluation-harness/lm_eval/evaluator.py", line 579, in evaluate
    task_output.calculate_aggregate_metric(bootstrap_iters=bootstrap_iters)
  File "/data/xxx/lm-evaluation-harness/lm_eval/evaluator_utils.py", line 96, in calculate_aggregate_metric
    agg_fn = self.task.aggregation()[metric]
KeyError: 'word_perplexity'

Running script:

lm_eval --model hf \
    --model_args pretrained=/data/xxx/cache/models--baffo32--decapoda-research-llama-7B-hf/snapshots/aa18b48a1330572a6dd5f5d5619ed19838ca285c \
    --tasks wikitext \
    --device cuda:1 \
    --batch_size 8 \

It seems the modification to wikitext YAML file does not work.

haileyschoelkopf commented 3 months ago

Hi! What's going on here is that you're trying to use a metric acc that isn't set up to work with the loglikelihood_rolling metric. It's on our todo list to make metrics more clear and to prevent silent failures made in cases like this.

When you insert acc, what do you expect to be computed? Token-level accuracy as a % ?

duanhx1037 commented 3 months ago

Hi! What's going on here is that you're trying to use a metric acc that isn't set up to work with the loglikelihood_rolling metric. It's on our todo list to make metrics more clear and to prevent silent failures made in cases like this.

When you insert acc, what do you expect to be computed? Token-level accuracy as a % ?

Yes, what about token-level accuracy? It seems not easy to modify this file correctly. At least the output_type and doc_to_xxx should be set properly.