uneven distribution of GPU workload

Hello,

Thanks so much for providing such resource so that we can all leverage the latest development of AI on different platforms.

I was able to use your example to run a job to fine tune Llama-2 7B on an old server with 8 NVIDIA 1080 GPU, which is estimated to take 33 hours to finish. However, I noticed that not all the GPUs are fully utilized as you can see the NVTOP screenshot below. Is there any configuration I can use to speed up the work?

python qlora/qlora.py --model_name_or_path llama-2-7b-HF/ --use_auth --output_dir llama-2-guanaco-7b --logging_steps 10 --save_strategy steps --data_seed 42 --save_steps 500 --save_total_limit 40 --evaluation_strategy steps --eval_dataset_size 1024 --max_eval_samples 1000 --per_device_eval_batch_size 1 --max_new_tokens 32 --dataloader_num_workers 1 --group_by_length --logging_strategy steps --remove_unused_columns False --do_train --do_eval --lora_r 64 --lora_alpha 16 --lora_modules all --double_quant --quant_type nf4 --fp16 --bits 4 --warmup_ratio 0.03 --lr_scheduler_type constant --gradient_checkpointing --dataset oasst1 --source_max_len 16 --target_max_len 512 --per_device_train_batch_size 1 --gradient_accumulation_steps 16 --max_steps 1875 --eval_steps 187 --learning_rate 0.0002 --adam_beta2 0.999 --max_grad_norm 0.3 --lora_dropout 0.1 --weight_decay 0.0 --seed 0 --max_memory_MB 10000

artidoro / qlora

uneven distribution of GPU workload #262