[CRASH!] System crash/reboot on RTX 4090

rajatmodi62 commented 1 year ago

hi all, i am making this post after banging my head over these past 2 weeks of hell, my small brain has exhausted all the things that it can think of.

My gpu is asus rtx 4090 overclock edition from tufts gaming os:ubuntu 22.04 . processor: intel 2X xeon 4214R CPU power supply:1650w >> 450 watt which 4090 requires. ram: 100 Gigs.

for some reason, running pytorch codes on it crashes the whole gpu and causes the machine to reboot. more specifically, training the official repositories of detr and dino detr locally is enough to crash the machine.

here is the debug things i have tried: [1] ran a simple pytorch code

import torch 
import torch.nn as nn
import torch.optim as optim
x = torch.randn((1000000,700)).cuda()
print("shape", x.shape)
class model(torch.nn.Module):

    def __init__(self):
        super(model, self).__init__()
        self.l = nn.Linear(700,1000)
        self.l2 = nn.Linear(1000,700)
    def forward(self, x):
        x = self.l(x)
        x = self.l2(x)
        return x

m = model().cuda()
loss = nn.MSELoss()
adam = optim.Adam(m.parameters() ,lr=0.001,)

i = 0
while 1:
    print(i)
    i+=1
    out = m(x)
    l = loss(out, out + 1e-7)
    l.backward()
    adam.step()

this achieves 100% gpu utilization and reaches peak temps. this DOES NOT crash

[2] ran gpu burn, memtest for over 12 hours -> all passed. eliminates power supply/ram issues. [3] however, running official detr/dino detr training code causes a lot of crashes. dinodetr can crash in such 40 training iterations. [4] machine works fine with other generations of cards like quadro and pascals.

now, this has led me to believe that this is a software issue and not a hardware one since simple benchmarking works.
but, my expertise stops at knowing which precise call reproduces this issue, so, i would be grateful if someone could please give some suggestions: [1] is it my machine issue? [2] is it a nvidia-smi issue or a pytorch issue? i have tried both latest dev(430) and stable (425) drivers. also, tried cuda 11,11.7,12. all have the same issue. [3] or can it be my gpu issue? [4] are other people also facing such troubles? [5] i dont think it is a temperature issue, since most of the times the crashes happen even if temperature is in range. @ptrblck and everyone, will be grateful for your guidance, can you think of anything that might be the cause, something (api) somewhere is doing something it shoudn't :frowning: , but i cant think of it, thanks, rajat

dataplayer12 commented 1 year ago

You can use pytorch ngc containers to make sure your software environment is properly setup. Also, while running the container make sure that the container and driver are compatible and that CUDA is not running in compatibility mode. I had some issues too when I bought the newly released 3090 back in the day. I'm not sure if these will fix your problems, but they worked for me with a new GPU in 2020.

ptrblck commented 1 year ago

for some reason, running pytorch codes on it crashes the whole gpu and causes the machine to reboot.

This sounds like a PSU issue and I would recommend replacing it (even if only temporarily) with another (larger) one.

facebookresearch / detr

[CRASH!] System crash/reboot on RTX 4090 #575