ROCm / rocm_smi_lib

ROCm SMI LIB
https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/
MIT License
116 stars 49 forks source link

rocm-smi fails during initialization if old AMD GPUs are present #82

Closed 0cc4m closed 9 months ago

0cc4m commented 3 years ago

I have been using the deprecated rocm-smi for a while now to monitor the status of my GPUs. I have a FirePro S10000 (Tahiti), which works with amdgpu, but does not provide, or only provides on a different path, some of the hardware interfaces expected from newer GPUs (for example voltages, clocks, power draw/cap and gpu_busy_percent). This caused the now-deprecated rocm-smi to show a warning about being unable to read gpu_busy_percent, but otherwise it worked.

This new rocm-smi version sadly straight-up fails to deal with this and errors out during initialization.

> /opt/rocm/bin/rocm-smi
rsmi_init() failed
Exception caught: rsmi_init.
ERROR:root:ROCm SMI returned 8 (the expected value is 0)

I have already narrowed this initialization problem down to an attempt to read /sys/class/hwmon/hwmon2/in0_label, which does not exist on monitors of the Tahiti GPUs. This leads to the program to attempt to find "" within kVoltSensorNameMap, which throws an exception (Map::at).

Even without this issue, these GPUs don't provide a frequency table (as far as I know), which causes another exception:

» ./rocm-smi
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
python3: [..]/src/rocm_smi_lib-rocm-4.1.0/src/rocm_smi.cc:895: rsmi_status_t get_frequencies(amd::smi::DevInfoTypes, uint32_t, rsmi_frequencies_t*, uint32_t*): Assertion `f->frequency[i-1] <= f->frequency[i]' failed.
[1]    69803 abort (core dumped)  ./rocm-smi

I don't expect rocm-smi to support these old GPUs, but it would be good if it still worked when old GPUs are present. Let me know if you need more information.

Relevant part of lspci:

0a:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ba)
0b:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ba)
0b:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ba)
0c:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti PRO GL [FirePro Series]
0c:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti HDMI Audio [Radeon HD 7870 XT / 7950/7970]
0d:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti PRO GL [FirePro Series]
0e:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c1)
0f:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
10:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] (rev c1)
10:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device ab28

Hardware monitor files of Tahiti:

» ls /sys/class/drm/card0/device/hwmon/hwmon2/
device       fan1_input   freq1_input  freq2_input  name   pwm1         pwm1_max  subsystem   temp1_crit_hyst  temp1_label
fan1_enable  fan1_target  freq1_label  freq2_label  power  pwm1_enable  pwm1_min  temp1_crit  temp1_input      uevent

Hardware monitor files of Navi21:

» ls /sys/class/drm/card2/device/hwmon/hwmon4
device       fan1_target  in0_input       power1_cap      pwm1_max         temp1_emergency  temp2_emergency  temp3_emergency
fan1_enable  freq1_input  in0_label       power1_cap_max  pwm1_min         temp1_input      temp2_input      temp3_input
fan1_input   freq1_label  name            power1_cap_min  subsystem        temp1_label      temp2_label      temp3_label
fan1_max     freq2_input  power           pwm1            temp1_crit       temp2_crit       temp3_crit       uevent
fan1_min     freq2_label  power1_average  pwm1_enable     temp1_crit_hyst  temp2_crit_hyst  temp3_crit_hyst
dmitrii-galantsev commented 9 months ago

non-existing or empty files should be properly working now. Please reopen if it's still an issue.