Multi-GPU Training Giving Different Loss

I am trying to train using a multi-GPU setup with DDP (launching with accelerate launch), but I am noticing that the loss values are significantly different from a single GPU setup with the same effective batch size.

I have attached the eval/loss curves below.

In purple is a single gpu run with per_device_train_batch_size=16
In blue is a multi gpu run with 8 gpus and per_device_train_batch_size=2 (only trained for a few steps)

All other hyperparameters are the same.

I am wondering why in (2) the loss values seem to be much smaller than in (1)? Any suggestions are much appreciated!

artidoro / qlora

Multi-GPU Training Giving Different Loss #276