multi gpu uneven VRAM utilization

artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs

https://arxiv.org/abs/2305.14314

MIT License

10.04k stars 821 forks source link

Open ehartford opened 1 year ago

ehartford commented 1 year ago

hello when I train with multi gpu like this

WORLD_SIZE=8 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 qlora.py \

Then I get uneven VRAM utilization:

This means, that I have to use a smaller batch size than I otherwise could, which causes my build to take 30% longer than it should.

I don't have this problem when doing multi-gpu build in full-weights (non-qlora) using accelerate or deepspeed.

nickmitchko commented 1 year ago

What model / other parameters are you using for torch run? I personally try to stay away from torchrun and use accelerate instead.

Models have about even vram usage 40.97GB v 40.95GB