[Performance significantly drop when increase the batch_size]

Hello, I use the latest and v0.4.3 version of lm_eval and I find the weird phenomena on llama-3.2-3B The following is my script:

BATCH_SIZE=256

torchrun --nproc-per-node=8 --no-python lm_eval \
    --model_args pretrained=meta-llama/Llama-3.2-3B \
    --tasks gsm8k_cot \
    --batch_size BATCH_SIZE

[llama-3.2-3B] Nodes=1, GPUs=8, llama

batch_size=1
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     8|exact_match|↑  |0.2987|±  |0.0126||         
|         |       |strict-match    |     8|exact_match|↑  |0.2835|±  |0.0124|

batch_size=32
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     8|exact_match|↑  |0.2790|±  |0.0124|
|         |       |strict-match    |     8|exact_match|↑  |0.2616|±  |0.0121|

batch_size=128
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     8|exact_match|↑  |0.1304|±  |0.0093|
|         |       |strict-match    |     8|exact_match|↑  |0.1221|±  |0.0090|

batch_size=256
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     8|exact_match|↑  |0.0409|±  |0.0055|
|         |       |strict-match    |     8|exact_match|↑  |0.0364|±  |0.0052|

accelerate 1.0.1

EleutherAI / lm-evaluation-harness

[Performance significantly drop when increase the batch_size] #2498

`hf (pretrained=meta-llama/Llama-3.2-3B), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 256`	Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k_cot	3	flexible-extract	8	exact_match	↑	0.2980	±	0.0126
		strict-match	8	exact_match	↑	0.2828	±	0.0124