EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
5.95k stars 1.58k forks source link

Multiple issues Encountered During Tasks Verification #1885

Open zhabuye opened 2 months ago

zhabuye commented 2 months ago

I am currently verifying all tasks under the lm-evaluation-harness. I will raise the issues I encounter one after another in this issue thread. Thank you for your inspection and response! @haileyschoelkopf @lintangsutawika Below is the command template I executed:

lm_eval --model hf --model_args pretrained=Qwen1.5-0.5B --tasks task_name --device cuda:0 --batch_size 16
accelerate launch -m lm_eval --model hf --model_args pretrained=Qwen1.5-0.5B --tasks ask_name --batch_size 16
lm_eval --model hf --model_args pretrained=Qwen1.5-0.5B,parallelize=True --tasks ask_name --batch_size 16
zhabuye commented 2 months ago

group: basque-glue task: bhtc_v2, bec, vaxx_stance, qnlieu, wiceu, epec_korref_bin issue: The tasks bec and epec_korref_bin both have the issue of Tasks not found, after reviewing the YAML file, I found bec2016eu and epec_koref_bin, I would like to ask if there is a mistake in the README description? The above group and tasks have both encountered the issue:

File "/lm-evaluation-harness/lm_eval/evaluator_utils.py", line 80, in from_taskdict
    n_shot = task_config.get("metadata", {}).get("num_fewshot", 0)
AttributeError: 'list' object has no attribute 'get'

done #1913

zhabuye commented 2 months ago

group: belebele task: belebele_kac_Latn issue: Single GPU evaluation passed, but the multi-GPU evaluation encountered an error with the message requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://huggingface.co/api/datasets/facebook/belebele/paths-info/75b399394a9803252cfec289d103de462763db7c. Similarly, that can be normally downloaded and evaluated on a single GPU, but encounter errors during multi-GPU evaluation, most of which are network-related or ValueError. The list is as follows:

codexglue_code2text
french_bench
gpt3_translation_benchmarks
triviaqa
xnli
translation
generate_until
openllm
zhabuye commented 2 months ago

group: pile_10k, gpqa issue: During Multi-GPU evaluation, these two tasks only have one to two out of the eight cards with high utilization, which results in an out of memory error even when the batch size for pile_10k is set to 1.

zhabuye commented 2 months ago

group: scrolls task: scrolls_contractnli, scrolls_govreport, scrolls_narrativeqa, scrolls_qasper, scrolls_qmsum, scrolls_quality, scrolls_summscreenfd issue: Both Single GPU evaluation and multi-GPU evaluation are reporting the error ValueError: A random.Random generator argument must be provided to rnd. Additionally, when performing evaluation with scrolls, it shows Tasks not found, meaning that the group name cannot be used to package and evaluate the subtasks.

zhabuye commented 2 months ago

task: t0_eval issue: Both Single GPU evaluation and multi-GPU evaluation are reporting the error:

File "/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 233, in _load_individual_task_or_group
    group_name = name_or_config["group"]
KeyError: 'group'
zhabuye commented 2 months ago

task: flan_held_out issue: Both Single GPU evaluation and multi-GPU evaluation are reporting the error:

  File "/lm-evaluation-harness/lm_eval/filters/__init__.py", line 21, in build_filter_ensemble
    f = partial(get_filter(function), **kwargs)
TypeError: the first argument must be callable
zhabuye commented 2 months ago

task: csatqa issue:

datasets.exceptions.DatasetNotFoundError: Dataset 'EleutherAI/csatqa' doesn't exist on the Hub or cannot be accessed. If the dataset is private or gated, make sure to log in with `huggingface-cli login` or visit the dataset page at https://huggingface.co/datasets/EleutherAI/csatqa  to ask for access.

It seems that there has been a change in the dataset name on Hugging Face. I hope you can check it.

zhabuye commented 2 months ago

task: bigbench_generate_until issue: Both Single GPU evaluation and multi-GPU evaluation are reporting the error:

  File "/data2/miniconda3/envs/ZYB/lib/python3.8/site-packages/datasets/arrow_reader.py", line 255, in read
    raise ValueError(msg)
ValueError: Instruction "train" corresponds to no data!
zhabuye commented 2 months ago

task: bigbench_multiple_choice issue: Both Single GPU evaluation and multi-GPU evaluation are reporting the error:

  File "/data2/miniconda3/envs/ZYB/lib/python3.8/site-packages/jinja2/environment.py", line 1301, in render
    self.environment.handle_exception()
  File "/data2/miniconda3/envs/ZYB/lib/python3.8/site-packages/jinja2/environment.py", line 936, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 1, in top-level template code
ValueError: "English: You don't want to push the button lightly, but rather punch it hard." is not in list
zhabuye commented 2 months ago

task: ifeval issue: Both Single GPU evaluation and multi-GPU evaluation are reporting the error:

File "/home/ZYB/project/lm-evaluation-harness/lm_eval/tasks/ifeval/utils.py", line 9, in <module>
    class InputExample:
  File "/home/ZYB/project/lm-evaluation-harness/lm_eval/tasks/ifeval/utils.py", line 11, in InputExample
    instruction_id_list: list[str]
TypeError: 'type' object is not subscriptable

pass in python3.9

zhabuye commented 2 months ago

task: pile issue: Both Single GPU evaluation and multi-GPU evaluation are reporting the error:

File "/home/anaconda3/envs/ZYB/lib/python3.8/site-packages/datasets/builder.py", line 374, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/anaconda3/envs/ZYB/lib/python3.8/site-packages/datasets/builder.py", line 599, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'pile_openwebtext2' not found. Available: ['all', 'enron_emails', 'europarl', 'free_law', 'hacker_news', 'nih_exporter', 'pubmed', 'pubmed_central', 'ubuntu_irc', 'uspto', 'github']
zhabuye commented 2 months ago

task: storycloze issue: Both Single GPU evaluation and multi-GPU evaluation are reporting the error:

File "/home/anaconda3/envs/ZYB/lib/python3.8/site-packages/datasets/builder.py", line 374, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/anaconda3/envs/ZYB/lib/python3.8/site-packages/datasets/builder.py", line 612, in _create_builder_config
    builder_config = self.BUILDER_CONFIG_CLASS(**config_kwargs)
  File "<string>", line 8, in __init__
  File "/home/anaconda3/envs/ZYB/lib/python3.8/site-packages/datasets/builder.py", line 131, in __post_init__
    if invalid_char in self.name:
TypeError: argument of type 'int' is not iterable
zhabuye commented 2 months ago

task: wmt-t5-prompt issue: Both Single GPU evaluation and multi-GPU evaluation are reporting the error:

Traceback (most recent call last):
  File "/home/ZYB/project/lm-evaluation-harness/lm_eval/api/task.py", line 1369, in process_results
Traceback (most recent call last):
  File "/home/ZYB/project/lm-evaluation-harness/lm_eval/api/task.py", line 1369, in process_results
Traceback (most recent call last):
  File "/home/ZYB/project/lm-evaluation-harness/lm_eval/api/task.py", line 1369, in process_results
Traceback (most recent call last):
  File "/home/ZYB/project/lm-evaluation-harness/lm_eval/api/task.py", line 1369, in process_results
Traceback (most recent call last):
  File "/home/ZYB/project/lm-evaluation-harness/lm_eval/api/task.py", line 1369, in process_results
    result_score = self._metric_fn_list[metric](
TypeError: 'NoneType' object is not callable
Some-random commented 1 month ago

I have the same problem on storycloze

haileyschoelkopf commented 1 month ago

Hi @zhabuye , thanks very much for going through these!

I'll try to respond or address these one by one ASAP.

minaremeli commented 1 month ago

group: basque-glue task: bhtc_v2, bec, vaxx_stance, qnlieu, wiceu, epec_korref_bin issue: The tasks bec and epec_korref_bin both have the issue of Tasks not found, after reviewing the YAML file, I found bec2016eu and epec_koref_bin, I would like to ask if there is a mistake in the README description? The above group and tasks have both encountered the issue:

File "/lm-evaluation-harness/lm_eval/evaluator_utils.py", line 80, in from_taskdict
    n_shot = task_config.get("metadata", {}).get("num_fewshot", 0)
AttributeError: 'list' object has no attribute 'get'

A minor change in the YAML files resolves the AttributeError for me. Remove the hyphen (-) in front of the version.

metadata:
  version: 1.0
zhabuye commented 1 month ago

group: basque-glue task: bhtc_v2, bec, vaxx_stance, qnlieu, wiceu, epec_korref_bin issue: The tasks bec and epec_korref_bin both have the issue of Tasks not found, after reviewing the YAML file, I found bec2016eu and epec_koref_bin, I would like to ask if there is a mistake in the README description? The above group and tasks have both encountered the issue:

File "/lm-evaluation-harness/lm_eval/evaluator_utils.py", line 80, in from_taskdict
    n_shot = task_config.get("metadata", {}).get("num_fewshot", 0)
AttributeError: 'list' object has no attribute 'get'

A minor change in the YAML files resolves the AttributeError for me. Remove the hyphen (-) in front of the version.

metadata:
  version: 1.0

Thank you very much for the proposal you made; it really solved the problem. In response, I will submit a Pull request.

zhabuye commented 1 month ago

When using the --tasks bbh option, it appears that the results only included the bbh_cot_fewshot group. Were the other three groups omitted? image image

LSinev commented 1 month ago

importing List from the typing module can avoid this error

May be less code changes with addition of from __future__ import annotations in import sections?

https://github.com/asottile/pyupgrade — maybe this can help to check for missed compatibility fixes

AshD commented 1 month ago

Having the exact same issue ...

task: storycloze issue: Both Single GPU evaluation and multi-GPU evaluation are reporting the error:

File "/home/anaconda3/envs/ZYB/lib/python3.8/site-packages/datasets/builder.py", line 374, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/anaconda3/envs/ZYB/lib/python3.8/site-packages/datasets/builder.py", line 612, in _create_builder_config
    builder_config = self.BUILDER_CONFIG_CLASS(**config_kwargs)
  File "<string>", line 8, in __init__
  File "/home/anaconda3/envs/ZYB/lib/python3.8/site-packages/datasets/builder.py", line 131, in __post_init__
    if invalid_char in self.name:
TypeError: argument of type 'int' is not iterable
chen1yunan commented 3 weeks ago

Having the exact same issue ...

task: storycloze issue: Both Single GPU evaluation and multi-GPU evaluation are reporting the error:

File "/home/anaconda3/envs/ZYB/lib/python3.8/site-packages/datasets/builder.py", line 374, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/anaconda3/envs/ZYB/lib/python3.8/site-packages/datasets/builder.py", line 612, in _create_builder_config
    builder_config = self.BUILDER_CONFIG_CLASS(**config_kwargs)
  File "<string>", line 8, in __init__
  File "/home/anaconda3/envs/ZYB/lib/python3.8/site-packages/datasets/builder.py", line 131, in __post_init__
    if invalid_char in self.name:
TypeError: argument of type 'int' is not iterable

same problem for me.. didn't work