AnswerDotAI / fsdp_qlora

Training LLMs with QLoRA + FSDP
Apache License 2.0
1.42k stars 188 forks source link

Dual GPU training instantly powers off my desktop #41

Closed rationalism closed 6 months ago

rationalism commented 7 months ago

Running fine-tuning with these settings makes my desktop instantly power off as soon as training starts:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0,1

python train.py \
--model_name mistralai/Mixtral-8x7B-v0.1 \
--batch_size 1 \
--context_length 2048 \
--precision bf16 \
--train_type qlora \
--use_gradient_checkpointing true \
--use_cpu_offload false \
--dataset alpaca \
--reentrant_checkpointing true \
       --log_to wandb \
       --gradient_accumulation_steps 4 \
       --lr_scheduler linear \
       --verbose false \
       --lora_rank 8 \
       --no_sync true

I have 2x 4090 GPUs, Ubuntu 22.04, PyTorch 2.2.1, CUDA 12.1., bitsandbytes 0.43.0, transformers 4.39.2. I'm pretty sure it's not a power supply or thermal issue, since I can run matrix multiplication benchmarks on both GPUs at once, with both of them at 450 watts, and that works fine. Training using naive model parallelism with text-generation-webui works fine.

geronimi73 commented 7 months ago

I'm pretty sure it's not a power supply or thermal issue, since I can run matrix multiplication benchmarks on both GPUs at once

to exclude it 100%, try power cap, e.g. nvidia-smi -pl 250

rationalism commented 7 months ago

@geronimi73 tried that, didn't work :(

geronimi73 commented 7 months ago

I still suspect this is a hardware issue. Had something very similar with a 3090 that made the machine reboot randomly during training. In my case, the temps already suggested that one of the GPUs had a problem because the faulty GPU got +5-10C hotter than the others.

You could try to confirm that the code/repo is not the problem by renting a 2x4090 and running your exact code or replacing each of your GPUs at home one by one with another 3090/4090 and see if the issue persists.

gjmulder commented 6 months ago

Maybe PCIe or memory bus errors under load? The matrix multiplication smoke tests probably don't stress the PCIe bus as much as dual GPU training.

rationalism commented 6 months ago

Suspect this is actually a power supply issue - the power supply is definitely rated for the load, but there are reports of this particular model behaving very weirdly, where it passes benchmarks but then randomly shuts down under some workloads but not others. Am testing a new power supply later this week, will see if that fixes it.

https://forums.tomshardware.com/threads/pc-randomly-shutting-down-and-rebooting-a-few-seconds-later-in-some-games-ran-occt-stress-test-on-psu-with-no-issues.3790396/

rationalism commented 6 months ago

@geronimi73 yeah I think it was the power supply, I replaced it with a new one from a different manufacturer and that seems to have fixed it. This was happening at less than a third of the rated load, and in a way that didn't affect stress tests or other training software 😠

Closing this out!