AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
143.48k stars 27.01k forks source link

[Bug]: Graphics overload causes the motherboard burned #11765

Open GenBill opened 1 year ago

GenBill commented 1 year ago

Is there an existing issue for this?

What happened?

When I was using the 4090 to generate a big size pictures, then the computer automatically shut down due to overheating, and then could not turn on. The test found that the motherboard was burned.

Steps to reproduce the problem

NO WAY TO BURN IT AGAIN

What should have happened?

Temperature detection should be added during operation and suspended if excessive temperature is detected. You can wait for a while to resume running, or simply stop the entire program.

Version or Commit where the problem happens

Version: v1.4.0+ (Latest Version)

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Linux

What device are you running WebUI on?

Nvidia GPUs (RTX 20 above)

Cross attention optimization

xformers

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

only --xformers

List of extensions

No app extensions

Console logs

null

Additional information

No response

w-e-w commented 1 year ago

MSI afterburner undervolt or power limit or thermal limit your GPU

MadMinstrel commented 1 year ago

This is a responsibility of the BIOS and GPU firmware, never end-user software. You probably simply had a faulty motherboard. I suggest making use of your warranty and improving your cooling solution.

oliverban commented 1 year ago

Yes, I mean "Temperature detection should be added during operation and suspended if excessive temperature is detected. You can wait for a while to resume running, or simply stop the entire program." <--- This is entirely up to you and your setup and has nothing to do with A1111. Do you think games should monitor your GPU or the motherboard/OS? Yeah, I thought so.....

OT: Like someone else mentioned, the first thing you do when installing a fresh PC is to install software for temp control and overview. I use MSI Afterburner och HWMonitor. If your card is overheating it might be faulty BIOS or you have a shitty case that is starved of air or something else. Learn some PC building before running extremely computationally heavy stuff for hours on end on a graphics card that is made for gaming not AI. I have 2x3090 and the one above the other is always hotter (duh!) so I have undervolted it and also tweaked the power limit as well as making sure to have an extra fan blowing between the cards to give the upper card more air.

Devicetron commented 1 year ago

This isn't software related, could be a fault in 4090s and their cables, Nvidia already investigated last year https://www.pcworld.com/article/1386084/nvidia-finally-responds-to-melting-rtx-4090-cable-controversy.html get better components because your 4090 is very power demanding and you should have very good cooling on any system.

GenBill commented 1 year ago

OK, maybe you can say "software doesn't cause hardware to burn", but this is a real case: when Amazon's mmorpg New World was just released, a lot of gamers' 3090s burned out due to abnormally high temperatures and very high transient power consumption while running the game.

Part of the reason was on the NVIDIA graphics hardware, but then Amazon officially also released a patch within the game to limit the graphics from overloading.

Perhaps the hardware is flawed, but good software logic can effectively prevent hardware failure.

GenBill commented 1 year ago

MSI afterburner underworld or power limit or thermal limit your GPU

I didn't overclock it. The machine's power limit and thermal limit are on by default.

GenBill commented 1 year ago

This isn't software related, could be a fault in 4090s and their cables, Nvidia already investigated last year https://www.pcworld.com/article/1386084/nvidia-finally-responds-to-melting-rtx-4090-cable-controversy.html get better components because your 4090 is very power demanding and you should have very good cooling on any system.

I'm using a 4090 laptop and the 4090 is soldered to the motherboard. It's unlikely that the solder joints are loose. (But it still may happen)

GenBill commented 1 year ago

OK, maybe you can say "software doesn't cause hardware to burn", but this is a real case: when Amazon's mmorpg New World was just released, a lot of gamers' 3090s burned out due to abnormally high temperatures and very high transient power consumption while running the game.

Part of the reason was on the NVIDIA graphics hardware, but then Amazon officially also released a patch within the game to limit the graphics from overloading.

Perhaps the hardware is flawed, but good software logic can effectively prevent hardware failure.

I have a full warranty on my computer so I have nothing to lose other than some time spent. I just want to share this issue to protect everyone's machines.

Adding the temperature detection pause feature might not be that hard either, just add the following code to the forward step loop:

if hot and hot_sleep_on: sleep (1)

I can easily add it locally myself. But push to the master branch will require your support.

highnrgappalachian commented 1 year ago

Reddit is that-a-way.

w-e-w commented 1 year ago

@GenBill what you ask for https://github.com/w-e-w/stable-diffusion-webui-GPU-temperature-protection I still say MSI afterburner is the way to go if you can't improve the thermals physically you're not the first one who asked fpr such a feature so I decide to slap together an extension to shut people up

alphaomegarandomname commented 1 year ago

Have killed 2 motherboards so far. I doubt it is temp related as i keep looking at those and they are perfect. Feels like voltage related. It does take a few thousand images to do it. Maybe im just unlucky.