artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs
https://arxiv.org/abs/2305.14314
MIT License
9.74k stars 800 forks source link

uneven distribution of GPU workload #262

Open liatamax opened 10 months ago

liatamax commented 10 months ago

Hello,

Thanks so much for providing such resource so that we can all leverage the latest development of AI on different platforms.

I was able to use your example to run a job to fine tune Llama-2 7B on an old server with 8 NVIDIA 1080 GPU, which is estimated to take 33 hours to finish. However, I noticed that not all the GPUs are fully utilized as you can see the NVTOP screenshot below. Is there any configuration I can use to speed up the work?

python qlora/qlora.py --model_name_or_path llama-2-7b-HF/ --use_auth --output_dir llama-2-guanaco-7b --logging_steps 10 --save_strategy steps --data_seed 42 --save_steps 500 --save_total_limit 40 --evaluation_strategy steps --eval_dataset_size 1024 --max_eval_samples 1000 --per_device_eval_batch_size 1 --max_new_tokens 32 --dataloader_num_workers 1 --group_by_length --logging_strategy steps --remove_unused_columns False --do_train --do_eval --lora_r 64 --lora_alpha 16 --lora_modules all --double_quant --quant_type nf4 --fp16 --bits 4 --warmup_ratio 0.03 --lr_scheduler_type constant --gradient_checkpointing --dataset oasst1 --source_max_len 16 --target_max_len 512 --per_device_train_batch_size 1 --gradient_accumulation_steps 16 --max_steps 1875 --eval_steps 187 --learning_rate 0.0002 --adam_beta2 0.999 --max_grad_norm 0.3 --lora_dropout 0.1 --weight_decay 0.0 --seed 0 --max_memory_MB 10000

image

ichsan2895 commented 10 months ago

Please check this post. Hopefully it solves the problem. Multi-gpu training example?