EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
7.11k stars 1.91k forks source link

[Performance significantly drop when increase the batch_size] #2498

Open yushengsu-thu opened 2 weeks ago

yushengsu-thu commented 2 weeks ago

Hello, I use the latest and v0.4.3 version of lm_eval and I find the weird phenomena on llama-3.2-3B The following is my script:

BATCH_SIZE=256

torchrun --nproc-per-node=8 --no-python lm_eval \
    --model_args pretrained=meta-llama/Llama-3.2-3B \
    --tasks gsm8k_cot \
    --batch_size BATCH_SIZE 

[llama-3.2-3B] Nodes=1, GPUs=8, llama

batch_size=1
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     8|exact_match|↑  |0.2987|±  |0.0126||         
|         |       |strict-match    |     8|exact_match|↑  |0.2835|±  |0.0124|

batch_size=32
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     8|exact_match|↑  |0.2790|±  |0.0124|
|         |       |strict-match    |     8|exact_match|↑  |0.2616|±  |0.0121|

batch_size=128
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     8|exact_match|↑  |0.1304|±  |0.0093|
|         |       |strict-match    |     8|exact_match|↑  |0.1221|±  |0.0090|

batch_size=256
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     8|exact_match|↑  |0.0409|±  |0.0055|
|         |       |strict-match    |     8|exact_match|↑  |0.0364|±  |0.0052|

accelerate 1.0.1

baberabb commented 2 weeks ago

Hi! I can't reproduce this (1 GPU). Are you using the latest transformers?

hf (pretrained=meta-llama/Llama-3.2-3B), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 256 Tasks Version Filter n-shot Metric Value Stderr
gsm8k_cot 3 flexible-extract 8 exact_match 0.2980 ± 0.0126
strict-match 8 exact_match 0.2828 ± 0.0124