AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size

xiangchen-zhao commented 1 year ago

I am trying to run the scripts you provide in "Huggingface Accelerate Integration of Deepspeed" accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_offload.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path /mnt/lustre/share/images --batch-size 128 --accumulation-steps 4 --output output_zero3_offload However, I got an AssertionError AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu gradient_acc_step world_size 4096 != 128 4 1

And I found this error is caused by an exception in deepspeed package (deepspeed/runtime/config.py Line691)

I'm wondering if it's a version issue. Could you give the version of accelerate and deepspeed you used? Thanks!

my environment: CUDA: 11.3 torch: 1.11.0+cu113 python: 3.7.16 accelerator:0.18.0 deepspeed:0.9.0 Ubuntu: 18.04 8 * NVIDIA A10G NVIDIA-SMI 510.47.03

Zeqiang-Lai commented 1 year ago

Hello, we have tested the deepspeed=0.9.0 and it throws the same error as you face. We guess it is bug of the compatibility between the latest accelerate and deepspeed.

You could use deepspeed=0.8.3 accelerate=0.18.0, it could successfully run in our GPU cluster and should be able to run in yours.

xiangchen-zhao commented 1 year ago

Thanks, it works

gjm-anban commented 1 year ago

same error

William9Baker commented 1 year ago

same error

I tried this version, but it didn't work for me.

lcw99 commented 1 year ago

In my case, initializing TrainingArguments() before from_pretrained() caused an error, but changing the order eliminated the error.

accelerate 0.23.0 deepspeed 0.11.0

liya2001 commented 1 year ago

As for my case, I found that I forgot to set the parameter in the deepspeed_config to "auto"

OpenGVLab / InternImage

AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size #111