Open rajatmodi62 opened 1 year ago
You can use pytorch ngc containers to make sure your software environment is properly setup. Also, while running the container make sure that the container and driver are compatible and that CUDA is not running in compatibility mode. I had some issues too when I bought the newly released 3090 back in the day. I'm not sure if these will fix your problems, but they worked for me with a new GPU in 2020.
for some reason, running pytorch codes on it crashes the whole gpu and causes the machine to reboot.
This sounds like a PSU issue and I would recommend replacing it (even if only temporarily) with another (larger) one.
hi all, i am making this post after banging my head over these past 2 weeks of hell, my small brain has exhausted all the things that it can think of.
My gpu is asus rtx 4090 overclock edition from tufts gaming os:ubuntu 22.04 . processor: intel 2X xeon 4214R CPU power supply:1650w >> 450 watt which 4090 requires. ram: 100 Gigs.
for some reason, running pytorch codes on it crashes the whole gpu and causes the machine to reboot. more specifically, training the official repositories of detr and dino detr locally is enough to crash the machine.
here is the debug things i have tried: [1] ran a simple pytorch code
this achieves 100% gpu utilization and reaches peak temps. this DOES NOT crash
[2] ran gpu burn, memtest for over 12 hours -> all passed. eliminates power supply/ram issues. [3] however, running official detr/dino detr training code causes a lot of crashes. dinodetr can crash in such 40 training iterations. [4] machine works fine with other generations of cards like quadro and pascals.
now, this has led me to believe that this is a software issue and not a hardware one since simple benchmarking works.
but, my expertise stops at knowing which precise call reproduces this issue, so, i would be grateful if someone could please give some suggestions: [1] is it my machine issue? [2] is it a nvidia-smi issue or a pytorch issue? i have tried both latest dev(430) and stable (425) drivers. also, tried cuda 11,11.7,12. all have the same issue. [3] or can it be my gpu issue? [4] are other people also facing such troubles? [5] i dont think it is a temperature issue, since most of the times the crashes happen even if temperature is in range. @ptrblck and everyone, will be grateful for your guidance, can you think of anything that might be the cause, something (api) somewhere is doing something it shoudn't :frowning: , but i cant think of it, thanks, rajat