Fine tuning only runs on CPU

diabeticpilot commented 7 months ago

Hello,

I am running this on a few 2X 4090 cloud instances on Vast to test and benchmark. Most machines work without issues, however sometimes I have noticed on certain machines that the GPUs are never used and the fine-tuning stays running on the CPU only. Llama 2 70B can get 15-18s/it on most instances. For ones where the GPUs are not used, it is 800s/it.

nvidia-smi is showing no active processes and 0% on both GPUs. Any idea on how to troubleshoot or fix this issue?

Here is how I am running it and all the settings:

export CUDA_VISIBLE_DEVICES=1,0 python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --reentrant_checkpointing true \

Performance: [42:45<2887:27:12, 803.50s/it]

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+

johnowhitaker commented 7 months ago

I think on some shared machines export CUDA_VISIBLE_DEVICES=1,0 might reference cards other than the ones you're assigned. (Don't quote me on this but I think I just hit a similar issue). Removing that and running just the training script in a new shell where CUDA_VISIBLE_DEVICES isn't defined worked in my case.

js-2024 commented 6 months ago

I'm having the same issue on Linux Mint with 7x3090. The behavior is almost identical to what diabeticpilot described above, right down to the GPUs loading up a little under 12G of VRAM each, then going dormant and the CPU going to max. CPU RAM was allocated around 128GB.

zhksh commented 6 months ago

I'm having the same issue on Linux Mint with 7x3090. The behavior is almost identical to what diabeticpilot described above, right down to the GPUs loading up a little under 12G of VRAM each, then going dormant and the CPU going to max. CPU RAM was allocated around 128GB.

same here, alternating usage of GPU (4x3090) and CPU (24cores maxed out), training llama-3-8b, ~45s/it . No idea whats going on, but the loss logs right after CPU drops and GPU takes over, feels like inference is done on CPU and backprop on GPU.

zhksh commented 6 months ago

ok sorry, --use_cpu_offload false helps, i assumed "false" to be default

AnswerDotAI / fsdp_qlora

Fine tuning only runs on CPU #29