EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.9k stars 1.84k forks source link

[big-refactor] hellaswag benchmark: doc has type int #800

Closed Wehzie closed 1 year ago

Wehzie commented 1 year ago

Dear maintainers, I fail to run the hellaswag benchmark on the big-refactor branch. A vicuna-7b-v1.3 model, run in float 16, is loaded.

Update: the same problem occurs with arc_challenge.

python main.py --model hf-auto --model_args pretrained=/path/to/vicuna-7B-base --tasks hellaswag
2023-08-22:17:31:49,013 INFO     [utils.py:160] NumExpr defaulting to 8 threads.
2023-08-22:17:31:51,390 INFO     [instantiator.py:21] Created a temporary directory at /tmp/tmp3ijo0mel
2023-08-22:17:31:51,391 INFO     [instantiator.py:76] Writing /tmp/tmp3ijo0mel/_remote_module_non_scriptable.py
2023-08-22:17:31:51,826 WARNING  [__init__.py:95] Failed to load config in
                                 /home/user/desktop/llm-benchmarking/lm-evaluation-harness/lm_eval/tasks/storycloze/storycloze_2018.yaml
                                 Config will not be added to registry
                                 Error: task named 'storycloze_2016' conflicts with existing registered task!
2023-08-22:17:31:51,919 WARNING  [metric.py:13] PERSPECTIVE_API_KEY is not set. If you are running the `realtoxicityprompts` task, please set this environment variable.
2023-08-22:17:31:52,097 WARNING  [__init__.py:95] Failed to load config in
                                 /home/user/desktop/llm-benchmarking/lm-evaluation-harness/lm_eval/tasks/super_glue/cb/t5-prompt.yaml
                                 Config will not be added to registry
                                 Error: [Errno 2] No such file or directory: '/home/user/desktop/llm-benchmarking/lm-evaluation-harness/lm_eval/tasks/super_glue/cb/t5_utils.py'
2023-08-22:17:31:52,229 WARNING  [__init__.py:55] Failed to load benchmark in
                                 /home/user/desktop/llm-benchmarking/lm-evaluation-harness/lm_eval/benchmarks/t0_eval.yaml
                                 Benchmark will not be added to registry
                                 Error: No module named 'promptsource'
2023-08-22:17:31:52,248 INFO     [main.py:153] Selected Tasks: ['hellaswag']
2023-08-22:17:31:52,264 INFO     [huggingface.py:113] Using device 'cuda'
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.98s/it]
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
DEBUG-1: {'ind': 24, 'activity_label': 'Roof shingle removal', 'ctx_a': 'A man is sitting on a roof.', 'ctx_b': 'he', 'ctx': 'A man is sitting on a roof. he', 'endings': ['is using wrap to wrap a pair of skis.', 'is ripping level tiles off.', "is holding a rubik's cube.", 'starts pulling up roofing on a roof.'], 'source_id': 'activitynet~v_-JhWjGDPHMY', 'split': 'val', 'split_type': 'indomain', 'label': '3', 'query': 'Roof shingle removal: A man is sitting on a roof. He', 'choices': ['is using wrap to wrap a pair of skis.', 'is ripping level tiles off.', "is holding a rubik's cube.", 'starts pulling up roofing on a roof.'], 'gold': 3}
<class 'dict'>
DEBUG-2: {{choices}}
<class 'str'>
DEBUG-1: 3
<class 'int'>
DEBUG-2: {{choices}}
<class 'str'>
DEBUG-3
{{choices}}
3

The logs are obtained by modifying doc_to_choice() in lm_eval/api/task.py as follows.

    def doc_to_choice(self, doc: Any) -> List[str]:

        print(f"DEBUG-1: {doc}")
        print(type(doc))

        if self.prompt is not None:
            doc_to_choice = self.prompt
        elif self._config.doc_to_choice is None:
            eval_logger.error("doc_to_choice was called but not set in config")
        else:
            doc_to_choice = self._config.doc_to_choice

        print(f"DEBUG-2: {doc_to_choice}")
        print(type(doc_to_choice))

        if type(doc_to_choice) == str:
            return ast.literal_eval(utils.apply_template(doc_to_choice, doc))
        elif type(doc_to_choice) == list:
            return doc_to_choice
        elif type(doc_to_choice) == dict:
            return list(doc_to_choice.values())
        elif callable(doc_to_choice):
            return doc_to_choice(doc)
        elif hasattr(doc_to_choice, "get_answer_choices_list"):
            return doc_to_choice.get_answer_choices_list(doc)
        else:
            raise TypeError

The error originally originated from apply_template() in lm_eval/utils.py.

Here I added some debugging info as follows.

def apply_template(template: str, doc: dict) -> str:
    if isinstance(doc, dict):
        rtemplate = env.from_string(template)
        return rtemplate.render(**doc)
    else:
        print("DEBUG-3")
        print(template)
        print(doc)
        exit()
ZZR0 commented 1 year ago

Same ISSUE !!!

This cause by the following code from lm_eval/api/task.py. The test_target is int !!!, I believe the correct code would be test_target = [self.doc_to_choice(test_doc)[test_target]] . This seems like a mistake typo.

        if type(test_target) is list:
            self.multiple_target = len(test_target)
        else:
            if (type(test_target) is int) and (test_choice is not None):
                # The `test_target` is int !!!, I believe the correct code would be `test_target = [self.doc_to_choice(test_doc)[test_target]]`
                test_target = [self.doc_to_choice(test_target)[test_target]]
            else:
                test_target = [test_target]

        if test_choice is not None:
            check_choices = test_choice
        else:
            check_choices = test_target
jphme commented 1 year ago

same here, I always get jinja2.environment.Template.render() argument after ** must be a mapping, not int for many benchmarks (arc_challenge, hellaswag, sciq) on big-refactor.

haileyschoelkopf commented 1 year ago

@lintangsutawika does one of your recent PRs fix this issue, or do we still need to push a fix for this?

lintangsutawika commented 1 year ago

Oh no, real sorry I missed this issue. And yes, I remember having this same issue and patching a fix.

https://github.com/EleutherAI/lm-evaluation-harness/blob/8f448eed3f9a0cf107ac36fc81d66007cf0a7d8b/lm_eval/api/task.py#L661-L672

haileyschoelkopf commented 1 year ago

No worries! Just was wondering if #819 fixed this. I think we're also encountering an issue where Hellaswag fails to download properly from Github's CDN as of today/yesterday.

fattorib commented 1 year ago

Just confirming #819 fixes this! Tested with:

python -m lm_eval \
    --model hf \
    --model_args pretrained=EleutherAI/pythia-160m,dtype="float" \
    --tasks hellaswag,arc_challenge \
    --device cuda:0 \
    --batch_size 8

on most recent commit from big-refactor and evaluations run and finish.