Closed francis0407 closed 7 months ago
I have this issue on Radeon VII when the card overheats. If the card shuts down from overheating it can't be reset without turning off power to card or a full system shutdown not reboot. I have found other software related crashes often get automatically reset and recovered from but the card reverts to default settings when that happens.
Some things are only initialized during kernel boot. So the GPU reset doesn't always work. Sorry, don't have more detail. I'd really love for GPU reset to work as well :)
I'm using AMD GPU. Some of my codes have bugs. And if I kill the process, the GPU is not available. And the output of rocm-smi is:
If I use
rocm-smi --gpureset -d 0
to reset the GPU, the output is:But the GPU is still not available unless I reboot the computer.
The document of rocm-smi mentioned
Note that GPU reset will not always work, depending on the manner in which the GPU is hung
.If rocm-smi cannot reset the GPU, are there any other tools can do that?