I am trying to train using a multi-GPU setup with DDP (launching with accelerate launch), but I am noticing that the loss values are significantly different from a single GPU setup with the same effective batch size.
I have attached the eval/loss curves below.
In purple is a single gpu run with per_device_train_batch_size=16
In blue is a multi gpu run with 8 gpus and per_device_train_batch_size=2 (only trained for a few steps)
All other hyperparameters are the same.
I am wondering why in (2) the loss values seem to be much smaller than in (1)? Any suggestions are much appreciated!
I am trying to train using a multi-GPU setup with DDP (launching with
accelerate launch
), but I am noticing that the loss values are significantly different from a single GPU setup with the same effective batch size.I have attached the eval/loss curves below.
per_device_train_batch_size=16
per_device_train_batch_size=2
(only trained for a few steps)All other hyperparameters are the same.
I am wondering why in (2) the loss values seem to be much smaller than in (1)? Any suggestions are much appreciated!