ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
111 stars 48 forks source link

Are there any other way to reset the GPU except rocm-smi? #85

Closed francis0407 closed 7 months ago

francis0407 commented 3 years ago

I'm using AMD GPU. Some of my codes have bugs. And if I kill the process, the GPU is not available. And the output of rocm-smi is:

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
ERROR: 15 GPU[0]: power: Data (usually from reading a file) was not of the type that was expected   
================================================================================
================================================================================
Expected integer value from monitor, but got ""
ERROR: 15 GPU[0]:Data (usually from reading a file) was not of the type that was expected   
Expected integer value from monitor, but got ""
ERROR: 15 GPU[0]:Data (usually from reading a file) was not of the type that was expected   
GPU  Temp  AvgPwr  SCLK  MCLK  Fan   Perf     PwrCap       VRAM%  GPU%  
0    N/A   N/A     None  None  0.0%  unknown  Unsupported    0%   0%    
================================================================================
WARNING:         One or more commands failed
============================= End of ROCm SMI Log ==============================

If I use rocm-smi --gpureset -d 0 to reset the GPU, the output is:

======================= ROCm System Management Interface =======================
================================== Reset GPU ===================================
GPU[0]      : Successfully reset GPU 0
================================================================================

But the GPU is still not available unless I reboot the computer.

The document of rocm-smi mentioned Note that GPU reset will not always work, depending on the manner in which the GPU is hung.

If rocm-smi cannot reset the GPU, are there any other tools can do that?

rakataprime commented 1 year ago

I have this issue on Radeon VII when the card overheats. If the card shuts down from overheating it can't be reset without turning off power to card or a full system shutdown not reboot. I have found other software related crashes often get automatically reset and recovered from but the card reverts to default settings when that happens.

dmitrii-galantsev commented 7 months ago

Some things are only initialized during kernel boot. So the GPU reset doesn't always work. Sorry, don't have more detail. I'd really love for GPU reset to work as well :)