artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs
https://arxiv.org/abs/2305.14314
MIT License
10.04k stars 821 forks source link

multi gpu uneven VRAM utilization #240

Open ehartford opened 1 year ago

ehartford commented 1 year ago

hello when I train with multi gpu like this

WORLD_SIZE=8 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 qlora.py \

Then I get uneven VRAM utilization:

2023-08-08_11-27-38

This means, that I have to use a smaller batch size than I otherwise could, which causes my build to take 30% longer than it should.

I don't have this problem when doing multi-gpu build in full-weights (non-qlora) using accelerate or deepspeed.

nickmitchko commented 1 year ago

What model / other parameters are you using for torch run? I personally try to stay away from torchrun and use accelerate instead.

I'm having good success using this fork: https://github.com/ChrisHayduk/qlora-multi-gpu/

Models have about even vram usage 40.97GB v 40.95GB

image