AnswerDotAI / fsdp_qlora

Training LLMs with QLoRA + FSDP
Apache License 2.0
1.42k stars 188 forks source link

Fine tuning only runs on CPU #29

Open diabeticpilot opened 7 months ago

diabeticpilot commented 7 months ago

Hello,

I am running this on a few 2X 4090 cloud instances on Vast to test and benchmark. Most machines work without issues, however sometimes I have noticed on certain machines that the GPUs are never used and the fine-tuning stays running on the CPU only. Llama 2 70B can get 15-18s/it on most instances. For ones where the GPUs are not used, it is 800s/it.

nvidia-smi is showing no active processes and 0% on both GPUs. Any idea on how to troubleshoot or fix this issue?

Here is how I am running it and all the settings:

export CUDA_VISIBLE_DEVICES=1,0 python train.py --model_name meta-llama/Llama-2-70b-hf --batch_size 2 --context_length 2048 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset alpaca --reentrant_checkpointing true \

Performance: [42:45<2887:27:12, 803.50s/it]

nvidia-smi: +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:41:00.0 Off | Off | | 30% 29C P8 20W / 450W | 10717MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 On | 00000000:61:00.0 Off | Off | | 30% 30C P8 24W / 450W | 11015MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+

johnowhitaker commented 7 months ago

I think on some shared machines export CUDA_VISIBLE_DEVICES=1,0 might reference cards other than the ones you're assigned. (Don't quote me on this but I think I just hit a similar issue). Removing that and running just the training script in a new shell where CUDA_VISIBLE_DEVICES isn't defined worked in my case.

js-2024 commented 6 months ago

I'm having the same issue on Linux Mint with 7x3090. The behavior is almost identical to what diabeticpilot described above, right down to the GPUs loading up a little under 12G of VRAM each, then going dormant and the CPU going to max. CPU RAM was allocated around 128GB.

zhksh commented 6 months ago

I'm having the same issue on Linux Mint with 7x3090. The behavior is almost identical to what diabeticpilot described above, right down to the GPUs loading up a little under 12G of VRAM each, then going dormant and the CPU going to max. CPU RAM was allocated around 128GB.

same here, alternating usage of GPU (4x3090) and CPU (24cores maxed out), training llama-3-8b, ~45s/it . No idea whats going on, but the loss logs right after CPU drops and GPU takes over, feels like inference is done on CPU and backprop on GPU.

zhksh commented 6 months ago

ok sorry, --use_cpu_offload false helps, i assumed "false" to be default