Out-of-memory of multi-gpu evaluation

I configure acclerate config correctly but it gives me out-of-memory issue. I examine the GPU usage and can see that all the 4 processes are using the first gpu. I'm sure that my model fits into one card. I test one card evaluation and the process runs correctly. The following is my accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The following is my running script:

accelerate launch  main.py \
          --model meta-llama/Llama-2-7b-hf \
  --tasks mbpp \
  --max_length_generation 650 \
  --temperature 0.8 \
  --do_sample True \
  --n_samples 10 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations \
  --generation_only \
  --save_generations_path generations_llama-7b.json

bigcode-project / bigcode-evaluation-harness

Out-of-memory of multi-gpu evaluation #191