[CRASH!] System crash/reboot on RTX 4090

rajatmodi62 commented 1 year ago

hi all, i am making this post after banging my head over these past 2 weeks of hell, my small brain has exhausted all the things that it can think of.

My gpu is asus rtx 4090 overclock edition from tufts gaming os:ubuntu 22.04 . processor: intel 2X xeon 4214R CPU power supply:1650w >> 450 watt which 4090 requires. ram: 100 Gigs.

for some reason, running pytorch codes on it crashes the whole gpu and causes the machine to reboot. more specifically, training the official repositories of detr and dino detr locally is enough to crash the machine.

here is the debug things i have tried: [1] ran a simple pytorch code

import torch 
import torch.nn as nn
import torch.optim as optim
x = torch.randn((1000000,700)).cuda()
print("shape", x.shape)
class model(torch.nn.Module):

    def __init__(self):
        super(model, self).__init__()
        self.l = nn.Linear(700,1000)
        self.l2 = nn.Linear(1000,700)
    def forward(self, x):
        x = self.l(x)
        x = self.l2(x)
        return x

m = model().cuda()
loss = nn.MSELoss()
adam = optim.Adam(m.parameters() ,lr=0.001,)

i = 0
while 1:
    print(i)
    i+=1
    out = m(x)
    l = loss(out, out + 1e-7)
    l.backward()
    adam.step()

this achieves 100% gpu utilization and reaches peak temps. this DOES NOT crash

[2] ran gpu burn, memtest for over 12 hours -> all passed. eliminates power supply/ram issues. [3] however, running official detr/dino detr training code causes a lot of crashes. dinodetr can crash in such 40 training iterations. [4] machine works fine with other generations of cards like quadro and pascals.

now, this has led me to believe that this is a software issue and not a hardware one since simple benchmarking works.
but, my expertise stops at knowing which precise call reproduces this issue, so, i would be grateful if someone could please give some suggestions: [1] is it my machine issue? [2] is it a nvidia-smi issue or a pytorch issue? i have tried both latest dev(430) and stable (425) drivers. also, tried cuda 11,11.7,12. all have the same issue. [3] or can it be my gpu issue? [4] are other people also facing such troubles? [5] i dont think it is a temperature issue, since most of the times the crashes happen even if temperature is in range. @ptrblck and everyone, will be grateful for your guidance, can you think of anything that might be the cause, something (api) somewhere is doing something it shoudn't :frowning: , but i cant think of it, thanks, rajat

YejinHwang909 commented 1 year ago

@rajatmodi62 hi!! Is this problem solved?

rajatmodi62 commented 1 year ago

hi... no, sadly it seems that the latest rtx cards are having such issues.... several of my folks have similar issues across different machines.

foolLain commented 8 months ago

Hi ! Have you solved this problem? My GPU server has 4 cards of 4090 , the server will reboot as soon as i run some code, the server has adequate PSU with 2*2000w , so i also think it's a software issue ,but i have no idea what causes this.

IDEA-Research / DINO

[CRASH!] System crash/reboot on RTX 4090 #180