error: list index out of range, when testing in multi-gpu?

wwngh1233 commented 1 year ago

bigcode-evaluation-harness/lm_eval/utils.py:388 in │ │ complete_code │ │ │ │ 385 │ │ │ if not INFILL_MODE: │ │ 386 │ │ │ │ gen_code = gen_code[len(prefix) :] │ │ 387 │ │ │ if postprocess: │ │ ❱ 388 │ │ │ │ code_gens[sample].append( │ │ 389 │ │ │ │ │ task.postprocess_generation(gen_code, int(sample))

loubnabnl commented 1 year ago

Can you provide the execution command you used, the output of accelerate env and full stack trace?

wwngh1233 commented 1 year ago

put all tasks into one scripts: accelerate launch main.py \ --model $bath_path$Model \ --tasks $python_tasks_easy,$math_tasks_greedy,$python_tasks_hard \ --max_length_generation 512 \ --temperature 1.0 \ --do_sample False \ --top_k 1 \ --n_samples 1 \ --batch_size 1 \ --precision fp16 \ --allow_code_execution \ --save_generations \ --save_generations_path results/$Model/$save_prefix.json

wwngh1233 commented 1 year ago

python_tasks_easy="humaneval,mbpp" python_tasks_medium="ds1000-numpy-completion,ds1000-pandas-completion,ds1000-scipy-completion,ds1000-matplotlib-completion,ds1000-sklearn-completion,ds1000-pytorch-completion" #instruct-humaneval,instruct-humaneval-nocontext" ds1000-tensorflow-completion ds1000-all-completion

math_tasks_greedy="pal-gsm8k-greedy,pal-gsmhard-greedy" math_tasks_majority_voting="pal-gsm8k-majority_voting,pal-gsmhard-majority_voting"

python_tasks_hard="apps-introductory,apps-interview,apps-competition"

loubnabnl commented 1 year ago

Can you make sure it runs properly for one task? and then increment to find what's causing the issue-

dlvp commented 12 months ago

I think I am experiencing a similar issue.

When running the following command on a single node with multiple GPUs (8):

accelerate launch main.py --model bigcode/santacoder --task multiple-py,mbpp --n_samples 1 --batch_size 1 --max_length_generation 50 --temperature 0.2 --trust_remote_code --generation_only --save_generations --save_references

I get the following error:

IndexError: list index out of range
Traceback (most recent call last):
  File "/home/bigcode-evaluation-harness/main.py", line 277, in <module>
    main()
  File "/home/bigcode-evaluation-harness/main.py", line 249, in main
    generations, references = evaluator.generate_text(task)
  File "/home/bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "/home/bigcode-evaluation-harness/lm_eval/generation.py", line 104, in parallel_generations
    generations = complete_code(
  File "/home/bigcode-evaluation-harness/lm_eval/utils.py", line 273, in complete_code
    code_gens[sample].append(

I didn't find a difference with other values of n_samples or batch_size. --max_length_generation 50 is just for speed in this example. Other task combinations give the same issue. More than 1 task seems to be the issue for me.

There is no error if I:

Generate one task at the time
Generate multiple tasks, but on a single GPU

loubnabnl commented 11 months ago

It seems there's an issue with the processors accessing simultaneously different tasks. The save_generations_path also needs to be separate for each task. While this gets fixed, I suggest you only evaluate on a single task and use a bash loop to go over multiple tasks instead of doing it in the harness, since it was intended to run sequentially anyway:

tasks=(multiple-py multiple-java mbpp)

for task in "${tasks[@]}"; do
    echo "Running task $task"
        accelerate launch main.py --model bigcode/santacoder 
        --task $task \
        --n_samples 1 \
        --batch_size 1 \
        --max_length_generation 50 \
        --temperature 0.2 \
        --trust_remote_code \
        --generation_only \
        --save_generations_path generations_$task.json
done

bigcode-project / bigcode-evaluation-harness

error: list index out of range, when testing in multi-gpu? #105