Closed rationalism closed 6 months ago
I'm pretty sure it's not a power supply or thermal issue, since I can run matrix multiplication benchmarks on both GPUs at once
to exclude it 100%, try power cap, e.g. nvidia-smi -pl 250
@geronimi73 tried that, didn't work :(
I still suspect this is a hardware issue. Had something very similar with a 3090 that made the machine reboot randomly during training. In my case, the temps already suggested that one of the GPUs had a problem because the faulty GPU got +5-10C hotter than the others.
You could try to confirm that the code/repo is not the problem by renting a 2x4090 and running your exact code or replacing each of your GPUs at home one by one with another 3090/4090 and see if the issue persists.
Maybe PCIe or memory bus errors under load? The matrix multiplication smoke tests probably don't stress the PCIe bus as much as dual GPU training.
Suspect this is actually a power supply issue - the power supply is definitely rated for the load, but there are reports of this particular model behaving very weirdly, where it passes benchmarks but then randomly shuts down under some workloads but not others. Am testing a new power supply later this week, will see if that fixes it.
@geronimi73 yeah I think it was the power supply, I replaced it with a new one from a different manufacturer and that seems to have fixed it. This was happening at less than a third of the rated load, and in a way that didn't affect stress tests or other training software ðŸ˜
Closing this out!
Running fine-tuning with these settings makes my desktop instantly power off as soon as training starts:
I have 2x 4090 GPUs, Ubuntu 22.04, PyTorch 2.2.1, CUDA 12.1., bitsandbytes 0.43.0, transformers 4.39.2. I'm pretty sure it's not a power supply or thermal issue, since I can run matrix multiplication benchmarks on both GPUs at once, with both of them at 450 watts, and that works fine. Training using naive model parallelism with text-generation-webui works fine.