Open xiangchen-zhao opened 1 year ago
Hello, we have tested the deepspeed=0.9.0 and it throws the same error as you face. We guess it is bug of the compatibility between the latest accelerate and deepspeed.
You could use deepspeed=0.8.3 accelerate=0.18.0, it could successfully run in our GPU cluster and should be able to run in yours.
Thanks, it works
same error
same error
I tried this version, but it didn't work for me.
In my case, initializing TrainingArguments() before from_pretrained() caused an error, but changing the order eliminated the error.
accelerate 0.23.0 deepspeed 0.11.0
As for my case, I found that I forgot to set the parameter in the deepspeed_config to "auto"
I am trying to run the scripts you provide in "Huggingface Accelerate Integration of Deepspeed"
accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_offload.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path /mnt/lustre/share/images --batch-size 128 --accumulation-steps 4 --output output_zero3_offload
However, I got an AssertionError AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu gradient_acc_step world_size 4096 != 128 4 1And I found this error is caused by an exception in deepspeed package (deepspeed/runtime/config.py Line691)
I'm wondering if it's a version issue. Could you give the version of accelerate and deepspeed you used? Thanks!
my environment: CUDA: 11.3 torch: 1.11.0+cu113 python: 3.7.16 accelerator:0.18.0 deepspeed:0.9.0 Ubuntu: 18.04 8 * NVIDIA A10G NVIDIA-SMI 510.47.03