EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.9k stars 1.84k forks source link

Accuracy gap between single GPU and multiple GPUs #1751

Open HsuWanTing opened 6 months ago

HsuWanTing commented 6 months ago

I'm using lm-eval v0.4.2 to evaluate Llama 7b on the open llm leaderboard benchmark. I found that there are accuracy gaps between single GPU and multiple GPUs as below. (I used data parallel)

  average ARC-c HellaSwag MMLU TruthfulQA WinoGrande GSM8K
4 GPUs (batch size 4) 46.58 50.85 78.13 35.14 34.08 71.82 9.48
4 GPUs (batch size 1) 46.61 50.85 78.12 35.17 34.08 71.9 9.55
1 GPU (batch size 4) 46.37 50.43 77.82 35.14 34.08 71.74 9.02
1 GPU (batch size 1) 46.42 50.43 77.83 35.17 34.08 71.74 9.25

Single GPU got overall lower accuracies. ARC-c, hellaswag and GSM8K drop 0.3~0.5. I thought that data-parallel only speeds up the evaluation. Where did the difference come from?

Below is the command line I used for ARC-c. Use CUDA_VISIBLE_DEVICES to control the number of GPUs.

accelerate launch --main_process_port $PORT -m lm_eval \
        --model hf \
        --model_args pretrained=huggyllama/llama-7b \
        --tasks arc_challenge \
        --num_fewshot 25 \
        --output_path $output_path \
        --batch_size $batch_size
LSinev commented 6 months ago

Thank you for your efforts! Great table with results to compare!

Where did the difference come from?

Please check other issues/discussions about speed, batches and multiple GPU usage for ideas. For example (but not limited to), https://github.com/EleutherAI/lm-evaluation-harness/issues/1625 https://github.com/EleutherAI/lm-evaluation-harness/issues/704#issuecomment-1670189773

HsuWanTing commented 6 months ago

Thank @LSinev for the quick reply. I've checked the issues you linked and also searched for some others by myself. Most of the issues focus on the difference between different batch sizes which is usually very small. I understand there will be some order difference so the loglikelihood might be slightly different and this small difference is acceptable to me.

However, I didn't find the issue about the difference between different numbers of GPUs. In my case, the accuracy drops 0.3~0.5 when using a single GPU. I think the drop is quite large. Is this also an expected result?

LSinev commented 6 months ago

Is this also an expected result?

No idea. According to your results from table it is also task dependent issue. You may want to further research this case with deep diving in code an logging. May be even use not yet merged PRs like https://github.com/EleutherAI/lm-evaluation-harness/pull/1731 to check consistency of tokenization (may be some adding of special tokens happens incorrectly), propagation of seeds, splitting (and restoring original order) of batches and so on.

If even just batch size makes difference, I suppose multiple GPU may propose even more difference.