artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs
https://arxiv.org/abs/2305.14314
MIT License
9.96k stars 820 forks source link

Multi-GPU Training Giving Different Loss #276

Open nikhil-ghosh-berkeley opened 10 months ago

nikhil-ghosh-berkeley commented 10 months ago

I am trying to train using a multi-GPU setup with DDP (launching with accelerate launch), but I am noticing that the loss values are significantly different from a single GPU setup with the same effective batch size.

I have attached the eval/loss curves below.

  1. In purple is a single gpu run with per_device_train_batch_size=16
  2. In blue is a multi gpu run with 8 gpus and per_device_train_batch_size=2 (only trained for a few steps)

All other hyperparameters are the same.

Screenshot 2023-11-16 at 3 17 55 PM

I am wondering why in (2) the loss values seem to be much smaller than in (1)? Any suggestions are much appreciated!

giaosudau commented 10 months ago

Hey @nikhil-ghosh-berkeley
Can you give full command how to launch the qlora with multiple gpus? I am training on 4GPUs A100-40GB but got OOM.